Add a Dynamo deepdive to documentation (#122305)

This supersedes the previous `Guards Overview" as a more comprehensive
approach to most of the main topics within Dynamo.

In the future, we could add specific sections for each of the topics
discussed here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122305
Approved by: https://github.com/msaroufim
diff --git a/docs/source/torch.compiler.rst b/docs/source/torch.compiler.rst
index 8a498c7..69dcd70 100644
--- a/docs/source/torch.compiler.rst
+++ b/docs/source/torch.compiler.rst
@@ -103,7 +103,7 @@
    :maxdepth: 1
 
    torch.compiler_deepdive
-   torch.compiler_guards_overview
+   torch.compiler_dynamo_deepdive
    torch.compiler_dynamic_shapes
    torch.compiler_nn_module
    torch.compiler_best_practices_for_backends
diff --git a/docs/source/torch.compiler_dynamo_deepdive.rst b/docs/source/torch.compiler_dynamo_deepdive.rst
new file mode 100644
index 0000000..79af3da
--- /dev/null
+++ b/docs/source/torch.compiler_dynamo_deepdive.rst
@@ -0,0 +1,876 @@
+Dynamo Deep-Dive
+================
+
+TorchDynamo (or simply Dynamo) is the tracer within ``torch.compile``,
+and it is, more often than not, the one to blame for those insane
+backtraces. However, we cannot blindly blame Dynamo for these errors. In
+order to provide the user with the flexibility it does, Dynamo is given
+the arduous task of understanding any Python program. In particular,
+Dynamo has to implement a good part of the Python programming language
+internally!
+
+In this post, we will go over the internal design of Dynamo from the
+ground up. We will discuss the functionality it provides, and how it is
+implemented. By the end of this post, you will have a better
+understanding of what went wrong when you ``torch.compiled`` a PyTorch
+program and the compilation errored out, or succeeded but the speed-up
+was not what you expected. [1]_
+
+A Gentle Introduction to Dynamo
+-------------------------------
+
+Before getting our hands dirty with all the implementation details,
+let’s start by discussing what it is that Dynamo does.
+
+Dynamo is a tracer. This means, given and function and inputs to it, it
+executes the function and records a linear sequence of instructions
+(without control flow) into a graph. For example, consider the following
+program:
+
+.. code:: python
+
+   import torch
+
+   @torch.compile
+   def mse(x, y):
+       z = (x - y) ** 2
+       return z.sum()
+
+   x = torch.randn(200)
+   y = torch.randn(200)
+   mse(x, y)
+
+If we save this program into the file ``example.py`` and we run
+
+.. code:: bash
+
+   TORCH_LOGS=graph_code python example.py
+
+we see the output that Dynamo traced
+
+.. code:: python
+
+   def forward(l_x_: torch.Tensor, l_y_: torch.Tensor):
+       # File: example.py:5, code: z = (x - y) ** 2
+       sub = l_x_ - l_y_
+       z = sub ** 2
+       # File: example.py:6, code: return z.sum()
+       sum_1 = z.sum()
+       return (sum_1,)
+
+We call this a **graph (or trace) of the function for the given
+inputs**. This is represented via an `FX
+graph <https://pytorch.org/docs/stable/fx.html>`__. We will simply think
+of an FX graph as a container that stores a list of function calls.
+
+The first thing we should notice is that the graph is a linear sequence
+of PyTorch operations. [2]_ Dynamo records all the PyTorch operations
+and stores them sequentially. For example, it split ``z = (x - y) ** 2``
+into its two constituting operations, ``sub = l_x_ - l_y_`` and
+``z = sub ** 2``.
+
+When we say that the trace is linear, we mean that there is no branching
+or any control flow. To see this, consider
+
+.. code:: python
+
+   import torch
+
+   @torch.compile
+   def fn(x, n):
+       y = x ** 2
+       if n >= 0:
+           return (n + 1) * y
+       else:
+           return y / n
+
+   x = torch.randn(200)
+   fn(x, 2)
+
+which, when executed with ``TORCH_LOGS=graph_code``, returns
+
+.. code:: python
+
+   def forward(l_x_: torch.Tensor):
+       # File: example.py:5, code: y = x ** 2
+       y = l_x_ ** 2
+       # File: example.py:7, code: return (n + 1) * y
+       mul = 3 * y
+       return (mul,)
+
+We see that Dynamo completely removed the ``if`` statement from the
+trace and just recorded the operations that were executed with the
+inputs.
+
+As such, it should be clear that **the trace of a function depends on
+the inputs**. In particular, this means that the trace is not generated
+when we write ``@torch.compile``, but when we execute the function
+``fn(x, 2)`` with the actual arguments.
+
+The other interesting thing to note here is that Dynamo removed the
+second argument to the function. Instead, it treated it as a constant
+and recorded the result of the operation ``n + 1`` in the graph. This is
+another feature of Dynamo: Dynamo will treat as constant any non-tensor
+value… other than ints. Let’s see now how are ints special.
+
+The last defining property of Dynamo is that it knows how to handle
+dynamic shapes. Symbolic shapes refer to Dynamo’s ability of tracing
+shapes, and more generally, integers, rather than leaving them as
+constants. This allows for avoiding recompilations and deploying generic
+models that work for any size in production. The main examples of places
+where dynamic shapes appear are the batch size, where we might train a
+model with a fixed batch size but then perform inference for an
+arbitrary batch size, or the variable sequence length that one
+encounters when processing text or audio.
+
+We can see this by executing a few more times the example above
+
+.. code:: python
+
+   import torch
+
+   @torch.compile
+   def fn(x, n):
+       y = x ** 2
+       if n >= 0:
+           return (n + 1) * y
+       else:
+           return y / n
+
+   x = torch.randn(200)
+   fn(x, 2)
+   fn(x, 3)
+   fn(x, -2)
+
+In this case, ``TORCH_LOGS=graph_code`` generates two more graphs
+
+.. code:: python
+
+   # Graph for n==2 omitted
+
+   def forward(self, l_x_: torch.Tensor, l_n_: torch.SymInt):
+       # File: a.py:5, code: y = x ** 2
+       y = l_x_ ** 2
+
+       # File: a.py:7, code: return (n + 1) * y
+       add = l_n_ + 1
+       mul = add * y
+       return (mul,)
+
+.. code:: python
+
+   def forward(self, l_x_: torch.Tensor, l_n_: torch.SymInt):
+       # File: a.py:5, code: y = x ** 2
+       y = l_x_ ** 2
+
+       # File: a.py:9, code: return y / n
+       truediv = y / l_n_
+       return (truediv,)
+
+Dynamo detected that one integer changed its value after the first call
+and started tracing it. We see that these graphs are generic, and trace
+the variable ``n`` symbolically via an object of type ``SymInt``.
+
+If after these calls we call ``fn(x, 4)``, Dynamo would not recompile,
+but rather reuse the graph that was already traced.
+
+To summarize: 1. Dynamo is a Python tracer 2. Given some inputs, it
+returns an FX graph with the PyTorch functions that were executed 3. It
+can also trace integers if it detects that they changed between calls 4.
+It specializes any other value that is not a tensor or a scalar
+
+Of course, Dynamo does many more things, like figuring out when it needs
+to retrace, rewriting the bytecode of the function, implementing graph
+breaks… To keep the introduction short, we will incrementally discuss
+all these in the sequel.
+
+PEP 523: Adding a frame evaluation API to CPython
+-------------------------------------------------
+
+Imagine now that we are given the task to implement Dynamo. Where would
+we even start? Rather conveniently for us, `PEP
+523 <https://peps.python.org/pep-0523/>`__ was released with Python 3.6.
+This PEP `was
+designed <https://peps.python.org/pep-0523/#a-jit-for-cpython>`__ to
+allow third parties to create JIT compilers for Python. Let’s see how.
+
+**A note on CPython**: CPython is internally implemented as a `stack
+machine <https://en.wikipedia.org/wiki/Stack_machine>`__. A Python
+program is compiled into
+`bytecodes <https://en.wikipedia.org/wiki/Bytecode>`__ that then are
+executed by this interpreter. To learn more about these bytecodes, see
+the `dis module <https://docs.python.org/3/library/dis.html>`__ from the
+standard library. See also `the developer
+docs <https://devguide.python.org/internals/interpreter/>`__ for an
+introduction to CPython’s interpreter. We will assume that the reader is
+familiar with the notion of a stack machine.
+
+PEP 523 exposes an API where a user can add a custom per-function
+interpreter. Then, CPython will use this interpreter rather than its own
+to execute the function. In order to be able to execute the function, on
+entry, CPython provides the custom interpreter with things like - The
+bytecode of the function - The value of the arguments of the function
+(i.e., the local variables) and their names - The value of the global
+variables and their names - The builtin functions like ``abs`` or
+``print``
+
+You can see all the fields
+`here <https://github.com/pytorch/pytorch/blob/e891a3bba9f05697d72776f6e89347231a141f03/torch/csrc/dynamo/eval_frame.c#L50-L59>`__. [3]_
+
+In summary, CPython provides the user’s interpreter with all the
+information necessary to execute the function. [4]_
+
+With this API, we can implement a tracer by implementing an interpreter
+that runs the code and records in a graph all the PyTorch operations
+that occur during this execution. This is exactly what Dynamo does.
+
+Dynamo uses this CPython API to parse all these objects and packs them
+into `a Python
+structure <https://github.com/pytorch/pytorch/blob/e891a3bba9f05697d72776f6e89347231a141f03/torch/csrc/dynamo/eval_frame.c#L93-L108>`__.
+After it has done so… it goes back from C to python. Other than for this
+piece of code that communicates with CPython, Dynamo is fully
+implemented in Python.
+
+It should be clear that it is the decorator ``@torch.compile``\ ’s job
+to install the necessary scaffolding that will pass the bytecode, the
+args, global variables and so on to Dynamo when the function is called.
+Again, ``@torch.compile`` does not actually compile anything.
+
+Implementing CPython in Python
+------------------------------
+
+So, we are back in the Python world. We have the bytecode of a function,
+and all the context necessary to execute it. In particular, we have
+landed at
+```_convert_frame_assert`` <https://github.com/pytorch/pytorch/blob/b6df8414601e1e086e830ca9e919e7fdc8874e71/torch/_dynamo/convert_frame.py#L272-L274>`__.
+This is the function that the decorator ``torch.compile`` returns! We
+get to this function from
+```_dynamo.optimize`` <https://github.com/pytorch/pytorch/blob/b6df8414601e1e086e830ca9e919e7fdc8874e71/torch/_dynamo/eval_frame.py#L715-L727>`__.
+The decorator ``torch.compile`` is just a nice API around
+``_dynamo.optimize``.
+
+Before getting into implementing a Python interpreter, we want to define
+an `IR <https://en.wikipedia.org/wiki/Intermediate_representation>`__.
+In particular, we want to wrap all the local and global variables in our
+own internal classes. This allows us to better track these objects and
+group together objects that can be treated in the same way to the eyes
+of Dynamo.
+
+The parent class of the internal class structure is ``VariableTracker``
+and represents the different objects that Dynamo understands. For
+example, ``ListVariable``, represents a ``list`` object, and keeps
+internally a `list of
+``VariableTracker``\ s <https://github.com/pytorch/pytorch/blob/e38a3a6079a3861b4bc9f256120ec661f34e726d/torch/_dynamo/variables/lists.py#L48-L56>`__.
+Another example of ``VariableTracker`` is
+`ConstantVariable <https://github.com/pytorch/pytorch/blob/83c0763dda1f93c6cf552ba88260a0dc7a3ecb70/torch/_dynamo/variables/constant.py#L30>`__.
+ConstantVariable wraps all the `objects considered constant by
+Dynamo <https://github.com/pytorch/pytorch/blob/83c0763dda1f93c6cf552ba88260a0dc7a3ecb70/torch/_dynamo/variables/constant.py#L98-L107>`__.
+We also have special subclasses for objects that require special
+attention, like
+`TensorVariable <https://github.com/pytorch/pytorch/blob/83c0763dda1f93c6cf552ba88260a0dc7a3ecb70/torch/_dynamo/variables/tensor.py#L68-L69>`__.
+All these internal classes are defined in the
+```torch/_dynamo/variables`` <https://github.com/pytorch/pytorch/tree/83c0763dda1f93c6cf552ba88260a0dc7a3ecb70/torch/_dynamo/variables>`__
+folder.
+
+Python objects are wrapped into their corresponding ``VariableTracker``
+class in
+```VariableBuilder._wrap`` <https://github.com/pytorch/pytorch/blob/83c0763dda1f93c6cf552ba88260a0dc7a3ecb70/torch/_dynamo/variables/builder.py#L365>`__.
+This function is just a very long chain of ``elif``\ s that tries to
+recursively pattern-match the Python inputs into the appropriate type of
+``VariableTracker``.
+
+**Debugging tip**. When we get unexpected results from dynamo, it is
+sometimes caused by the builder. If the logic of the builder is wrong,
+sometimes Dynamo may wrap a variable in the incorrect
+``VariableTracker`` type, and this may cause issues later on. It is
+rather useful to have a look at the ``VariableTracker`` types that
+appear in the errors, and the ``VariableTracker`` method that throws the
+exception when you encounter a Dynamo error. In particular, sometimes we
+find that an object is tracked as a ``UserDefinedObjectVariable`` (this
+is Dynamo’s catch-all class), when it should have been tracked as
+something more specific. In these cases, the ``SourceBuilder.__call__``
+logic is often to blame.
+
+**Debugging tip**. When running a program with ``TORCH_LOGS=dynamo``,
+one of the artifacts that are printed out is lines of the form
+
+::
+
+   TRACE LOAD_GLOBAL y [TorchInGraphFunctionVariable(<built-in method any>), TensorVariable()]
+
+This is the bytecode for the original program and the state of the stack
+at that point. This is very useful to find where an object was not
+traced into the right ``VariableTracker``.
+
+Ok, so we have an IR for our tracer, now we *just* need to reimplement
+CPython’s stack machine. This is implemented by
+```InstructorTranslatorBase`` <https://github.com/pytorch/pytorch/blob/69f112d5867f785a3a090a0c6d6644ae047033ac/torch/_dynamo/symbolic_convert.py#L576-L594>`__
+in
+```symbolic_convert.py`` <https://github.com/pytorch/pytorch/blob/69f112d5867f785a3a090a0c6d6644ae047033ac/torch/_dynamo/symbolic_convert.py>`__.
+
+``InstructionTranslatorBase`` has about 200 methods, implementing almost
+all of Python bytecodes. As an example, we can see the implementation of
+``BUILD_LIST``
+
+.. code:: python
+
+   def BUILD_LIST(self, inst):
+       items = self.popn(inst.argval)
+       self.push(ListVariable(items, mutable_local=MutableLocal()))
+
+This is the bytecode generated by constructions like ``l = [2, 3, 4]``.
+In this case, since there are three elements, the generated bytecode is
+``BUILD_LIST 3``. This means that we pop the top ``3`` elements of the
+stack and push a new list object to the top of the stack formed by these
+three elements.
+
+Generating the Output Graph
+---------------------------
+
+With a way to symbolically execute Python code, we are set to extract
+the PyTorch operations that happen during the symbolic execution of a
+program given some inputs. This is implemented in Dynamo via the
+```OutputGraph`` <https://github.com/pytorch/pytorch/blob/69f112d5867f785a3a090a0c6d6644ae047033ac/torch/_dynamo/output_graph.py#L221-L230>`__
+object. The ``OutputGraph`` object is `bound to an
+``InstructionTranslator``
+object <https://github.com/pytorch/pytorch/blob/69f112d5867f785a3a090a0c6d6644ae047033ac/torch/_dynamo/symbolic_convert.py#L2060-L2071>`__
+and it tracks all the data necessary to create the FX graph which will
+be returned by Dynamo.
+
+All the inputs and intermediary elements of the FX graph are
+``fx.Node``\ s. In Dynamo, ``fx.Node``\ s are wrapped in
+``fx.Proxy``\ s. ``fx.Proxy``\ s are used to build the FX graph.
+In particular, they record every PyTorch operation performed on them
+into the graph. You can can create a new operation to be added to
+the graph by calling ```create_proxy`` <https://github.com/pytorch/pytorch/blob/fb80f05ee2e1cba17892980701bfd5dbce58349f/torch/_dynamo/output_graph.py#L430-L431>`__.
+Then, we can add it to the graph through the function
+```wrap_fx_proxy`` <https://github.com/pytorch/pytorch/blob/fb80f05ee2e1cba17892980701bfd5dbce58349f/torch/_dynamo/variables/builder.py#L1311>`__.
+
+A graph stores operations on tensors… and operations on symbolic
+integers. We will discuss symbolic integers later on, but first we will
+discuss how Dynamo addresses a rather important correctness issue.
+
+.. _making-dynamo-sound-guards:
+
+Making Dynamo Sound: Guards
+---------------------------
+
+At this point, we have a way to trace programs completely disregarding control flow.
+And for that, we have reimplemented all of CPython… If this sounds like a bit of an
+overkill, that is because it is.
+```torch.jit.trace`` <https://pytorch.org/docs/stable/generated/torch.jit.trace.html>`__
+already implements this without all this machinery, so what gives?
+
+The issue with ``torch.jit.trace``, as it is warned in its docs, is that
+it just works if the traced program is not data dependent. In other
+words, it will just work if the program itself is linear. This means
+writing our program without using if-elses, for-while loops, exceptions.
+Even more, none of the libraries that we use can use any control flow!
+All in all, not using control flow in a language as dynamic as Python
+is, in fact, a huge constraint.
+
+JAX solves this problem by always retracing and caching the graph after
+retracing. Dynamo, on the other hand, uses guards to avoid retracing the
+whole program every time.
+
+A **guard** is an assumption (a boolean expression on an input) made in
+order to specialize a frame for one set of example inputs. Reusing the
+graph is only valid if these assumptions hold on the new inputs.
+
+For example, any constant input to a function, like a string, installs a
+guard stating that that input should be of type ``str`` and equal to the
+string we passed. Running
+
+.. code:: python
+
+   import torch
+
+   @torch.compile
+   def fn(a, b):
+       return a * len(b)
+
+   fn(torch.arange(10), "Hello")
+
+with ``TORCH_LOGS=guards`` prints (among other guards)
+
+.. code:: python
+
+   ___check_type_id(L['b'], 94334122025024)
+   L['b'] == 'Hello'
+
+This reads as “the local variable ``b`` should have a specific type
+(``str`` in this case, represented by the constant `9433...`) and
+its value should be ``'Hello'``”. If we then execute the function
+again passing a different argument
+
+.. code:: python
+
+   import torch
+
+   @torch.compile
+   def fn(a, b):
+       return a * len(b)
+
+   fn(torch.arange(10), "Hello")
+   fn(torch.arange(10), "Hi")
+
+we can see the guard that failed by running ``TORCH_LOGS=recompiles``
+
+.. code:: python
+
+   Recompiling function fn in script.py:3
+   triggered by the following guard failure(s):
+        - L['b'] == 'Hello'
+
+Guards are accumulated while `the inputs to the function are wrapped in
+the
+builder <https://github.com/pytorch/pytorch/blob/69f112d5867f785a3a090a0c6d6644ae047033ac/torch/_dynamo/variables/builder.py#L808-L810>`__
+and `during the execution of the
+program <https://github.com/pytorch/pytorch/blob/69f112d5867f785a3a090a0c6d6644ae047033ac/torch/_dynamo/variables/dicts.py#L763-L769>`__.
+We will show many more examples of guards in the next section, but first
+let us discuss sources.
+
+A **source** tracks how to reconstruct a variable from the original
+local or global variables present when entering the current frame. In
+particular, it tracks the original local and global objects and any of
+the objects they contain. In
+
+.. code:: python
+
+   def foo(x: Tensor, y: List[Tensor]):
+       a = x * y[0]
+       return a * x
+
+``x`` and ``y`` have
+```LocalSource`` <https://github.com/pytorch/pytorch/blob/40dc0580a69565b06ec5263efe5d87cecc8200f7/torch/_dynamo/source.py#L80-L92>`__
+as their source, and ``y[0]`` has
+```GetItemSource`` <https://github.com/pytorch/pytorch/blob/40dc0580a69565b06ec5263efe5d87cecc8200f7/torch/_dynamo/source.py#L302>`__,
+which stores a ``LocalSource`` inside. On the other hand, ``a`` will not
+have a source as it is an intermediate variable that only exists within
+the fx graph.
+
+All these are defined in
+```torch/_dynamo/source.py`` <https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/source.py>`__.
+We can see the guard generated by ``GetItemSource`` in the following
+example:
+
+.. code:: python
+
+   import torch
+
+   @torch.compile
+   def fn(x, l):
+       return x * len(l[0])
+
+   fn(torch.randn(8), ["Hi", "Hello"])
+
+generates the following guards
+
+.. code:: python
+
+   ___check_type_id(L['l'], 94439025877664)
+   len(L['l']) == 2
+   ___check_type_id(L['l'][0], 94439025840192)
+   L['l'][0] == 'Hi'
+   ___check_type_id(L['l'][1], 94439025840192)
+   L['l'][1] == 'Hello'
+
+Here, we see the code generated by ``GetItemSource`` (``[0]`` and
+``[1]``) wrapping a ``LocalSource`` (``L['l']``).
+
+At this point, with sources and guards, we are able to implement a
+caching system to avoid recompilation without having to retrace every
+time. We will discuss a bit more in detail this caching system in the
+sequel.
+
+The attentive reader will have noticed that this does not explain yet
+why we need to have such fine control over the Python interpreter as to
+having to reimplement it. The examples of guards that we have shown
+depend on the input objects, so we could still compute these before
+executing the function. In other words, we could implement this guard
+system on top of ``torch.jit.trace`` and get the same functionality with
+much less effort… Enter symbolic shapes.
+
+Symbolic Shapes
+---------------
+
+Another point we discussed in the introduction is that Dynamo knows how
+to trace integers. In order to implement this, we use a symbolic class
+```torch.SymInt`` <https://github.com/pytorch/pytorch/blob/fb80f05ee2e1cba17892980701bfd5dbce58349f/torch/__init__.py#L244-L249>`__\  [5]_
+that acts like an ``int`` but it records all the operations performed on
+it in the output FX graph. We already saw this class in the introduction
+when introducing symbolic integer tracing.
+
+Let us now discuss the three properties that define symbolic shape
+tracing in Dynamo, and how to implement them.
+
+Static by default
+^^^^^^^^^^^^^^^^^
+
+Dynamo assumes that every integer, let that be an input or the shape of
+a tensor, is static by default. In other words, no integers will be
+traced on the first execution of a function. Then, only if it detects
+that an integer or a shape changed value during the execution, it will
+trace it and generate a graph generic on that variable.
+
+We already saw this behavior in the introduction using integers. Let us
+now look at an example using shapes of tensors.
+
+.. code:: python
+
+   import torch
+
+   @torch.compile
+   def fn(a, b):
+       return a.shape[0] * a * b
+
+   fn(torch.randn(4, 3), torch.randn(4, 3))
+   fn(torch.randn(8, 3), torch.randn(8, 3))
+
+Running this program with ``TORCH_LOGS=graph_code`` we see that these
+two calls are traced as
+
+.. code:: python
+
+   def forward(self, l_a_: torch.Tensor, l_b_: torch.Tensor):
+       mul = 4 * l_a_
+       mul_1 = mul * l_b_
+       return (mul_1,)
+
+   def forward(self, s0: torch.SymInt, l_a_: torch.Tensor, l_b_: torch.Tensor):
+       size = l_a_.size()
+       getitem = size[0]
+       mul = getitem * l_a_
+       mul_1 = mul * l_b_
+       return (mul_1,)
+
+In the first graph the shape is traced as a constant, but once it
+changes, it traces it symbolically using a ``SymInt``\ s. In general, a
+simpler way to see the shapes of the intermediary values is by running
+the program with ``TORCH_LOGS=graph_sizes``
+
+::
+
+   TRACED GRAPH TENSOR SIZES
+   ===== __compiled_fn_1 =====
+   l_a_: (s0, 3)
+   l_a_ (concrete): (8, 3)
+   l_b_: (s0, 3)
+   l_b_ (concrete): (8, 3)
+   mul: (s0, 3)
+   mul (concrete): (8, 3)
+   mul_1: (s0, 3)
+   mul_1 (concrete): (8, 3)
+
+where we can see that the first dimension of the two tensor args is
+dynamic, given that it is represented by the ``s0`` variable.
+
+We can find how Dynamo implements this by running ``TORCH_LOGS=guards``
+
+.. code:: python
+
+   # Guards first call
+   check_tensor(L['a'], torch.float32, device=None, requires_grad=False, size=[4, 3], stride=[3, 1])
+   check_tensor(L['b'], torch.float32, device=None, requires_grad=False, size=[4, 3], stride=[3, 1])
+
+   # Guards second call
+   check_tensor(L['a'], torch.float32, device=None, requires_grad=False, size=[None, 3], stride=[3, 1])
+   check_tensor(L['b'], torch.float32, device=None, requires_grad=False, size=[None, 3], stride=[3, 1])
+
+   L['b'].size()[0] == L['a'].size()[0]
+   2 <= L['a'].size()[0]
+
+We see that on the first call, the guards check that the tensors have
+some fixed sizes and strides. These guards fail in the second execution,
+so it retraces. Since it was an ``int`` guard that failed, in this
+second iteration it traces this ``int`` symbolically and it installs
+more general guards on this more generic kernel.
+
+**Compilation performance tip**. If you know that a dimension will vary
+in size, you can mark it as dynamic by calling
+```torch._dynamo.mark_dynamic`` <https://github.com/pytorch/pytorch/blob/66a76516bfc341b2b55bb2056d2faa9c2de46d69/torch/_dynamo/decorators.py#L176>`__
+before calling ``torch.compile``. This will avoid the first compilation
+with a static shape. There are other useful utility functions like
+``maybe_mark_dynamic`` or ``mark_static``. You can also have all
+integers and shapes traced by calling ``torch.compile(dynamic=True)``.
+This is mostly useful for debugging purposes.
+
+0, 1 are always specialized
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Regardless of whether we mark a dimension as dynamic, or we have traced
+an integer as dynamic, if we pass an input where that dimension is 0 or
+1, Dynamo will trace it as non-dynamic and it will generate a specific
+graph for it. This is the reason why in the example above we find guards
+of the form ``2 <= L['a'].size()[0]``.
+
+There are several reasons for this choice. There are two particularly
+important - A tensor is empty if and only if any of its dimensions is
+zero - A tensor can only be contiguous if one of the strides is one
+
+Duck shaping
+^^^^^^^^^^^^
+
+Dynamo performs what we call “duck shaping”. If two dynamic integers
+have the same value at trace time, we will assume that they are equal
+and guard on it. Effectively, this means that rather than having two
+symbols ``s0``, ``s1`` in the example above, we just unified them to
+``s0`` and had the guard ``L['b'].size()[0] == L['a'].size()[0]``. This
+enables performing fusions within the compiler while being able to
+generate kernels that are generic enough.
+
+Guards on symbolic ints
+^^^^^^^^^^^^^^^^^^^^^^^
+
+We now understand how symbolic shapes are implemented at a high level
+and the properties they have. Now, why is that symbolic shapes forced us
+through the tricky route of getting control of the CPython interpreter?
+Consider the following example:
+
+.. code:: python
+
+   import torch
+
+   @torch.compile(dynamic=True)
+   def fn(a):
+       if a.shape[0] * 2 < 16:
+           return a
+       else:
+           return a + 1
+
+   fn(torch.randn(8))
+
+This code has a guard of the form ``2*L['a'].size()[0] >= 16``. This is
+a non-trivial guard in terms of the inputs of the function, but it is
+registered in the middle of the execution of the program. Even more so,
+we cannot know this guard is needed until we see the ``if`` statement
+conditional on a ``SymNodeVariable`` argument. Such conditions are
+invisible to ``torch.jit.trace`` and require deep analysis of the python
+code.
+
+**Debugging tip** Running this code with ``TORCH_LOGS=dynamo`` tells us
+where this guard was added
+
+::
+
+   eval 2*s0 >= 16 [guard added] at script.py:5 in fn (_dynamo/variables/tensor.py:812 in evaluate_expr)
+
+Placing a breakpoint there and looking at the backtrace is rather useful
+to understand where a guard came from.
+
+Making Dynamo Complete: Graph Breaks
+------------------------------------
+
+With all the tools we have discussed, we have a tracer that can trace
+PyTorch operations on tensors and integers and has a caching system that
+knows when it can reuse a previously traced graph and when it needs to
+retrace. All this executing arbitrary Python code!
+
+There is just one small issue with this. The statement “executing
+arbitrary Python code” is perhaps a bit too general. Dynamo implements a
+good part of Python, but does it implement the more complex parts, like
+coroutines or async? Does it implement the whole Python standard
+library? NumPy also has a Python API. Does ``torch.compile`` also
+understand NumPy? and Django? [6]_
+
+Python’s ecosystem is massive, and a good part of it is written in other
+more performant languages like C++ or Rust, and it just exposes Python
+bindings. There is no hope in Dynamo tracing through Python objects that
+are implemented in C++. What can a tracer do when it finds an operation
+that it does not understand?
+
+The usual way machine learning tracers handle this issue is by informing
+the user that the operation they choked on and giving up tracing
+altogether. This would pose a real usability issue in the case of
+PyTorch, where its users are used to the flexibility it gives them. As a
+real-world example the ```doctr_det_predictor`` model uses NumPy and the
+``cv2`` library to postprocess the model’s
+result <https://github.com/mindee/doctr/blob/f2114758d529ed8d3d0030581638f0520b6b98d8/doctr/models/detection/core.py#L86>`__.
+
+Here is another place where having access to CPython is interesting.
+Rather than erroring out, Dynamo can let CPython run that problematic
+code! To do this, Dynamo generates at trace time one graph with all the
+operations before the problematic code, and one with all the operations
+after. [7]_ Then, at runtime, it will delegate to CPython to execute the
+first graph, then the problematic code, and then the second graph. This
+process of stopping the tracing and generating multiple graphs is called
+a **graph break**.
+
+A small confession: I lied all throughout the introduction and the first
+sections. Dynamo does not generate one graph, but **multiple graphs**!
+For all practical purposes, starting retracing after a second graph can
+be thought of as starting tracing a new function. The new graph after
+the graph break will have its own guards, its new set of local
+variables, and so on.
+
+To discuss how to implement graph breaks, we need to first revisit how
+Dynamo interacts with CPython. Using PEP 523, CPython allows a user to
+use their own frame evaluation mechanism. What we had not discussed is
+that CPython also exposes its own frame evaluation for others to use.
+Dynamo leverages this to let the fast CPython interpreter run the
+compiled code. For a function without graph breaks, the whole tracing /
+execution process of a program that calls the function 2 times with the
+same arguments looks like this:
+
+1. In the first call to the function
+
+   1. Dynamo traces the function into an FX graph
+
+      1. The FX graph is compiled by the compiler (Inductor) into
+         efficient low-level code… but that’s a story for another day
+
+   2. It rewrites the bytecode of the function so that it simply calls
+      the compiled function
+   3. It gives CPython this new bytecode and asks it to run it
+      [`here <https://github.com/pytorch/pytorch/blob/e891a3bba9f05697d72776f6e89347231a141f03/torch/csrc/dynamo/eval_frame.c#L1006>`__]
+
+2. In the second call to the function
+
+   1. It checks the guards from the first call against the new arguments
+      [`here <https://github.com/pytorch/pytorch/blob/e891a3bba9f05697d72776f6e89347231a141f03/torch/csrc/dynamo/eval_frame.c#L658>`__].
+      Since they are the same arguments as before, they pass
+   2. It asks CPython to run the bytecode associated to those guards
+      [`here <https://github.com/pytorch/pytorch/blob/e891a3bba9f05697d72776f6e89347231a141f03/torch/csrc/dynamo/eval_frame.c#L972-L975>`__]
+
+This process on its own looks overly complicated. Why generate new
+bytecode and ask CPython to run it rather than simply creating a C++
+binding to the compiled function and executing it? Well, this pattern
+allows us to implement graph breaks! The bytecode generated by a graph
+break has the following structure:
+
+1. Bytecode that executes the first graph
+2. Bytecode that leaves the stack as it would be if CPython would have
+   executed the first graph. It also replays any modifications to local
+   or global variables that would be visible at this point
+3. The bytecode that made Dynamo graph break
+4. Bytecode that executes the second graph
+
+Let us see this in a simple example
+
+.. code:: python
+
+   import torch
+
+   @torch.compile
+   def fn(a):
+       b = a + 2
+       print("Hi")
+       return b + a
+
+   fn(torch.randn(4))
+
+Running this with ``TORCH_LOGS=bytecode`` shows us the initial bytecode
+and the modified bytecode
+
+.. code:: python
+
+   MODIFIED BYTECODE fn script.py line 3
+    0 LOAD_GLOBAL              1 (__compiled_fn_0)
+    2 LOAD_FAST                0 (a)
+    4 CALL_FUNCTION            1
+    6 STORE_FAST               3 (graph_out_0)
+    8 LOAD_GLOBAL              0 (print)
+   10 LOAD_CONST               2 ('Hi')
+   12 LOAD_FAST                3 (graph_out_0)
+   14 LOAD_CONST               3 (0)
+   16 BINARY_SUBSCR
+   18 STORE_FAST               1 (b)
+
+   20 CALL_FUNCTION            1
+   22 LOAD_GLOBAL              2 (__resume_at_14_1)
+   24 ROT_TWO
+   26 LOAD_FAST                0 (a)
+   28 LOAD_FAST                1 (b)
+   30 CALL_FUNCTION            3
+   32 RETURN_VALUE
+
+   MODIFIED BYTECODE resume_in_fn script.py line 6
+    0 LOAD_GLOBAL              1 (__compiled_fn_2)
+    2 LOAD_FAST                2 (b)
+    4 LOAD_FAST                1 (a)
+    6 CALL_FUNCTION            2
+    8 UNPACK_SEQUENCE          1
+   10 RETURN_VALUE
+
+We can see that the modified bytecode is split into two functions,
+``fn``, the original function, and a function called ``resume_in_fn``.
+This second function is a function created by Dynamo to implement the
+execution of the program starting at the graph break. This is often
+called a `continuation
+function <https://en.wikipedia.org/wiki/Continuation>`__. This
+continuation function simply calls the second compiled function with the
+right arguments. The code for the initial function is rewritten
+implementing the strategy that we described before
+
+-  L0-4. Call the compiled function (``a + 2``).
+-  L6. Store its result in a local variable called ``graph_out_0``.
+   ``graph_out_0`` is a tuple
+-  L8-18. Leave the stack as it would be at the point of the graph break
+-  L20. Execute the code that caused the graph break
+-  L22-32. Call the compiled continuation function (``a + b``)
+
+The code generation of the stack in Dynamo is delegated to
+``VariableTracker`` subclasses. Every ``VariableTracker`` object in
+Dynamo has a ```reconstruct``
+method <https://github.com/pytorch/pytorch/blob/e891a3bba9f05697d72776f6e89347231a141f03/torch/_dynamo/variables/lists.py#L307-L309>`__
+that generates the necessary bytecode to create the python object it
+represents on the stack.
+
+**Debugging tip**. Graph breaks hamper performance, and as such, it is
+best to avoid them. Running a program with ``TORCH_LOGS=graph_breaks``
+is a great way to find how many graph breaks did our program hit. The
+information it returns is in terms of ``VariableTracker`` objects, so
+the debugging tips above are sometimes also helpful to figure out what
+caused that graph break.
+
+Conclusion
+----------
+
+Dynamo is a complex piece of software. Once you sign up to implement a
+CPython interpreter you know you are in for a ride. That being said, we
+hope that this post helps demystify it a bit.
+
+Dynamo is (mostly) implemented in Python. We left plenty of links to the
+pieces of the code that we discussed. We hope that reading those pieces
+of code and grepping for the places that call them, or putting
+breakpoints on them and looking at the call stack helps understanding
+the rest of the code base.
+
+Of course, the best way to learn how a piece of software works is by
+extending it. In this case, the best way is to have a look at the `open
+dynamo issues on
+github <https://github.com/pytorch/pytorch/issues?q=is%3Aissue+is%3Aopen+label%3A%22module%3A+dynamo%22+>`__.
+Many of them require very minor changes in the code, once you find where
+you need to make those changes.
+
+.. [1]
+   In the same way that Dynamo takes its name from
+   [Dynamorio].(https://dynamorio.org/), this blog post’s name is a
+   small nod to `You Could Have Invented Spectral
+   Sequences <https://www.ams.org/notices/200601/fea-chow.pdf>`__.
+
+.. [2]
+   In the literature, this is called a Directed Acyclical Graph (DAG).
+
+.. [3]
+   All this binding code lives in ``torch/csrc/dynamo/eval_frame.c``.
+
+.. [4]
+   In CPython lingo, the set of all these objects are called `a
+   frame <https://github.com/python/cpython/blob/f26bfe4b25f7e5a4f68fcac26207b7175abad208/Include/internal/pycore_frame.h#L57-L71>`__.
+
+.. [5]
+   There are also ``SymBool`` and ``SymFloat`` classes. The latter one
+   is not used all that much at the time of this writing.
+
+.. [6]
+   Interestingly enough, it does understand NumPy code! Have a look at
+   `this blogpost <https://pytorch.org/blog/compiling-numpy-code/>`__
+   and `the
+   docs <https://pytorch.org/docs/stable/torch.compiler_faq.html#does-numpy-work-with-torch-compile>`__.
+   Now, this is just possible because we reimplemented NumPy using
+   PyTorch. Good luck implementing Django in PyTorch though…
+
+.. [7]
+   Assuming there is just one piece of problematic code. If there are
+   more, Dynamo can split the code into as many graphs as it needs.
diff --git a/docs/source/torch.compiler_guards_overview.rst b/docs/source/torch.compiler_guards_overview.rst
deleted file mode 100644
index 52efeb3..0000000
--- a/docs/source/torch.compiler_guards_overview.rst
+++ /dev/null
@@ -1,512 +0,0 @@
-Guards Overview
-===============
-
-From a UX perspective, TorchDynamo is very easy to use. The user invokes
-``torchdynamo.optimize`` as an annotation:
-
-.. code-block:: python
-
-   @torchdynamo.optimize(my_compiler)
-   def fn_foo(bar):
-
-Where a complete example looks like this:
-
-.. code-block:: python
-
-   from typing import List
-   import torch
-   from torch import _dynamo as torchdynamo
-
-   def my_compiler(gm: torch.fx.GraphModule, example_inputs: List[torch.Tensor]):
-       print("my_compiler() called with FX graph:")
-       gm.graph.print_tabular()
-       return gm.forward  # return a python callable
-
-   @torchdynamo.optimize(my_compiler)
-   def toy_example(a, b):
-       x = a / (torch.abs(a) + 1)
-       if b.sum() < 0:
-           b = b * -1
-       return x * b
-
-   for _ in range(100):
-       toy_example(torch.randn(10), torch.randn(10))
-
-This allows TorchDynamo to capture the interpreted Python frames, grab
-any and all relevant information, and speed things up wherever it can.
-The speedup comes from a few places, and can be rather dependent on the
-backend (`my_compiler` in the example above) provided, but the one speedup
-that is important in this section is **caching**. Caching itself is not
-a direct speedup but a critical enablement that prevents
-recompilation. We dig a hole with dynamo, and caching allows us to get
-out. It enables us to hold perf
-neutrality while then enabling backends - the true source of our
-speedups.
-
-With even a pass-through no-op backend provided:
-
-.. code-block:: python
-
-   def my_compiler(gm: torch.fx.GraphModule, example_inputs: List[torch.Tensor]):
-       return gm.forward
-
-We can see TorchDynamo speeding up Python execution even on
-regular Python, not just PyTorch.
-
-Caching and Guards Overview
----------------------------
-
-TorchDynamo operates through caching transformed (by TorchDynamo) user
-bytecode. When TorchDynamo receives a frame for evaluation, it checks if the
-**objects referenced in the frame have changed** in certain ways, and if
-not, TorchDynamo reads the previously transformed user bytecode to evaluate it.
-In this section, we will focus on how we can identify whether or not the
-**objects referenced in the frame have changed**. This is a critical
-piece of functionality in TorchDynamo, because it drives the entire
-invalidation lifecycle. This functionality is called **guards**.
-
-At a very high level, the flow can be summarized like this:
-
-1. TorchDynamo receives a Python frame.
-2. It converts the frame (1) passing it through instruction
-   translation.
-3. For the objects captured in (2), TorchDynamo creates tracking objects that
-   are:
-
-   - tracked on an output graph, which is an internal specialization of a `torch.fx.Tracer`
-   - guards
-
-4. TorchDynamo processes the guard objects created in (3), turning them into a
-   generated Python function, `check_fn`, associated with a piece of code.
-5. The `check_fn` is evaluated whenever we encounter this code a
-   subsequent time - if a `check_fn` passes and evaluates to `True`, TorchDynamo
-   identifies the code in the cache and the code encountered here as same, and
-   can be safely used. If it fails and evaluates to `False`, TorchDynamo
-   identifies the code in the cache as not valid, and can be thrown out in
-   favor of a new entry, through recompilation or a graph break.
-
-Python Frame Evaluation and PEP 523
------------------------------------
-
-The functionality of TorchDynamo is based on
-`PEP 523 <https://peps.python.org/pep-0523/>`__.
-
-TorchDynamo installs a frame evaluation function on Python by using
-`_PyInterpreterState_SetEvalFrameFunc`. TorchDynamo has a hook where
-Python can hand control back to us during evaluation.
-
-The function we have installed is ``convert_frame`` or
-``convert_frame_assert`` in the ``nopython=True`` case, but glossing
-over that nuance for now, let’s take a look at ``convert_frame_assert``,
-as ``convert_frame`` proxies to it.
-
-We can find the function in ``torch/_dynamo/convert_frame.py`` with a signature
-as follows:
-
-.. code-block:: python
-
-   def  convert_frame_assert(compiler_fn: Callable, one_graph=True):
-
-This function wraps the entry point of where Python invokes TorchDynamo
-with a frame:
-
-.. code-block:: python
-
-   def  _convert_frame_assert(frame: types.FrameType, cache_size: int):
-
-Here is what this function does:
-
-1. Checks if it has seen this ``code``\ (see: f_code `here
-   <https://docs.python.org/3/library/inspect.html>`__) before and exits
-   early if it did.
-2. Checks if the code is an unsupported case.
-3. Checks if the ``cache_size`` (second arg above) crosses the limit
-   defined in the config, ``cache_size_limit``. If it has, the function
-   drops the frame and logs warnings. This helps to avoid constant
-   recompilation of a frame as it generally means that the frame is hot
-   in an unexpected way and caching it produces needless overhead,
-   as it is likely to get evicted the next time it is encountered.
-4. Passes the frame, alongside a function that creates an
-   ``InstructionTranslator`` through bytecode
-   transformation, via ``transform_code_object``. A few crucial things
-   happen under the hood here:
-
-   1. New code is produced through ``transform_code_object``.
-
-   2. An FX tracer named ``output`` is produced through
-      ``InstructionTranslator``. This can be a bit confusing,
-      as ``InstructionTranslator`` is not an `fx` tracer, but its stored
-      in a variable named tracer, and its output **is** an `fx` tracer.
-
-   3. The function produces guards and stores them on ``output`` above.
-
-   4. The function produces ``output_instructions`` and stores them on
-      ``output`` above.
-
-   5. The function maps the newly produced transformed code to the initial code it
-      read off the frame. This mapping is worth remembering, we will
-      refer to it much later on below where we cover guard failures.
-
-5. Using the transformed code from 4.1 and the guards from 4.3,
-   the function produces a `GuardedCode`.
-
-Now that we have learned about frame evaluation, let’s review
-``InstructionTranslator``, and see how it turns the frame we handed
-it over into TorchDynamo internal types.
-
-InstructionTranslator
----------------------
-
-`InstructionTranslator` does a lot! We won’t cover the details of
-everything it does, but most importantly for this document, it produces
-a mapping of ``symbolic_locals`` which maintains a mapping from the
-frame’s ``f_locals`` to TorchDynamo internal Variable objects (more on these
-in a moment. ``symbolic_locals`` is filled via traversing the frame’s
-locals:
-
-.. code-block:: python
-
-   self.symbolic_locals = collections.OrderedDict(
-       (k, VariableBuilder(self, LocalSource(k))(f_locals[k]))
-       for k in vars
-       if k in f_locals
-   )
-
-The important component here  is the invocation of a call
-into ``VariableBuilder``. ``VariableBuilder``\ ’s call implementation
-proxies into a function called ``_wrap``, which in turn both constructs
-instances of ``VariableTracker`` and calls ``make_guards`` on them. More
-on that later.
-
-This mapping, in turn, is critical as each Variable has associated
-guards, which are then passed to ``self.output``, the instance of
-``OutputGraph``, an fx tracer, mentioned in 4.2 of the section above. If
-you recall, this ``OutputGraph``, stored in a variable called ``output``
-is where our guards are stored before being passed on to become
-``GuardedCode``
-
-How does ``InstructionTranslator`` do this? At the heart of it, there is
-a loop that is pumped, which drives a function ``step``.
-
-``step`` is just that - a single processing step, taking exactly one
-instruction and doing *something* with it.
-
-.. note:: These are real instructions processed by TorchDynamo’s
-   ``transform_code_object``, and it is pretty cool.
-
-.. note:: This section purposely skips the details of
-   `dis.get_instructions <https://docs.python.org/3/library/dis.html>`__.
-
-For the example above, here is a snippet of a what a few
-``Instruction``\'s may look like:
-
-.. code-block:: python
-
-   Instruction(opcode=124, opname='LOAD_FAST', arg=0, argval='b', offset=32, starts_line=8, is_jump_target=True, target=None)
-   Instruction(opcode=100, opname='LOAD_CONST', arg=3, argval=-1, offset=34, starts_line=None, is_jump_target=False, target=None)
-   Instruction(opcode=20, opname='BINARY_MULTIPLY', arg=None, argval=None, offset=36, starts_line=None, is_jump_target=False, target=None)
-
-This is the core functionality of this function. Take a look at the ``opname``,
-and then take a look at this little snippet from inside ``step``;
-
-.. code-block:: python
-
-   if not hasattr(self, inst.opname):
-       unimplemented(f"missing: {inst.opname}")
-   getattr(self, inst.opname)(inst)
-
-As we can see, the function checks if the current class, the
-``InstructionTranslator`` has an attribute set matching the operator name
-(for example, ``LOAD_CONST``). If it does, the function invokes it, passing the
-whole instruction object in. If it does not, the function drops the frame as
-unimplemented.
-
-For the ``LOAD_CONST`` example, we can see that we do indeed support it,
-with a relatively straightforward definition:
-
-.. code-block:: python
-
-   def LOAD_CONST(self, inst):
-       self.push(ConstantVariable(value=inst.argval))
-
-We can see that this function creates a new instance of the class
-``ConstantVariable`` , with a value, in our example case, -1, and then
-pushes it onto the stack.
-
-There are dozens of such methods - see ``symbolic_convert.py`` for all of
-them. Generally, we implement as many matching methods to Python
-bytecode instructions as possible.
-
-Across both the logic downstream of ``step`` and the logic from invoking
-``VariableBuilder`` - we now have a lot of ``VariableTracker``\ s and of
-course, we’ve spoken about creating guards quiet a bit. Let’s dig into
-what Variables are, and get a little closer to understanding guards.
-
-Variables
----------
-
-A ``ConstantVariable`` is an instance of ``VariableTracker``.
-``VariableTracker`` represents a tracked Python local or stack value.
-
-When it comes to representing an object inside TorchDynamo, a
-``VariableTracker`` does exactly what it says - it tracks a given variable.
-It is an extremely flexible class, but there are a few points to keep in
-mind:
-
--  It manages the ``guard`` relationship around the underlying object
-   through:
-
-   -  ``make_guard``
-   -  ``replace_guards``
-   -  ``add_guard(s)``
-   -  ``propagate`` - ``propagate(*vars: List[List["VariableTracker"]])`` -
-      Perhaps the most important of all, in that it combines guards from
-      all the provided ``VariableTracker`` instances passed in. It visits
-      the guards and combines the guards from these onto itself.
-
--  It acts as a proxy on behalf of the underlying object, implementing
-   methods for the rest of TorchDynamo to get information about the
-   tracked object:
-
-   -  ``call_method``
-   -  ``call_function``
-   -  ``python_type``
-   -  ``as_proxy``
-   -  ``is/as_python_proxy``
-
--  It stores the variable ``source`` of type ``Source``, from
-   ``torchdynamo/source.py``. This source type is a relatively self
-   contained class that helps us organize and bookkeep where the original
-   source came from, and helps provide convenience methods for things
-   like getting the name, and importantly for us, producing guards.
-
-And this class (``VariableTracker``) is built around subclassing,
-somewhere between a full Abstract Base Class and fully fleshed out class
-- it leaves many methods raising ``NotImplementedError`` - with reliance on
-subclasses. See ``torchdynamo/variables/`` for all subclasses to fulfill
-contracts and custom behaviors.
-
-Knowing what we know now, we can see an example of how an instruction
-from ``dis``, ``BUILD_TUPLE``:
-
-   ``BUILD_TUPLE(count)`` Creates a tuple consuming count items from the
-   stack, and pushes the resulting tuple onto the stack.
-
-In our case, our signature will be a *little* different due to the way
-we create ``Instruction`` objects, but the gist of it will be the same.
-Instead of passing in ``count``, we pass in an object with a little
-extra bookkeeping, and of course, we deal with turning regular old
-python objects into TorchDynamo notions:
-
-.. code-block:: python
-
-   def BUILD_TUPLE(self, inst):
-       items = self.popn(inst.argval)
-       options = VariableTracker.propagate(items)
-       self.push(TupleVariable(items, **options))
-
-Here is what this code does:
-
-1. The function reads ``argval``, which in this case, is
-   analogous to ``counts`` in the pydoc for the equivalent instruction.
-
-2. The function ``popn`` the items, in this case, the signature is
-   ``def  popn(self, n: int) -> List[TensorVariable]:`` this hints at an
-   underlying contract - we are returning ``TensorVariables``. If we
-   take a closer look at ``symbolic_convert.py`` and
-   ``InstructionTranslatorBase``/``InstructionTranslator``\ we see that
-   the only thing pushed onto and popped from our stack are
-   ``VariableTracker``\ s.
-
-3) The function calls ``VariableTracker.propagate``. This
-   takes the guards from every single item popped off the stack in 2,
-   and recursively traverses it and combines all the guards into
-   ``options``: ``py  return {      "guards": guards,  }``
-
-4) The function then makes a new instance of a ``VariableTracker``,
-   ``TupleVariable``\ out of the ``items`` and ``options``. This then
-   allows us to install all the appropriate guards from the ``items``
-   that make up the new ``TupleVariable``
-
-.. note:: Where did the first guards come from? Propagation
-   is a good technique, but we need something created before it can be
-   propagated. ``VariableBuilder`` calls
-   ``make_guards`` as it creates ``VariableTracker`` instances, from
-   ``f_locals``. This in turn calls into the ``source``, to have it create
-   guards.
-
-After all this, bytecode translation is done and we are one step closer
-to producing ``GuardedCode``. We now understand how locals become
-``VariableTracker``\ s, how instructions are handled, and where guards
-are called on for creation. Before we can go into seeing how code and
-guards are combined into a GuardedCode object, we need to dig a little
-bit into those ``make_guard`` and ``source.make_guard`` calls above. We
-can then understand, what was going on when we made guards
-alongside, and on, ``VariableTracker`` instances.
-
-Making Guards
--------------
-
-Guards are just Python objects, of the class ``Guard``. Let's look at them
-in more detail.
-
-Looking at the definition of the dataclass (and therefore, ctor
-signature), we see that it has a name, a source, and a create function.
-
-.. code-block:: python
-
-   @dataclasses.dataclass
-   class Guard:
-       name: str
-       source: GuardSource
-       create_fn: Callable
-
-The name should be the name of the variable.
-
-The source here is an enum indicating what *kind* of source the guard
-belongs to.
-
-.. note:: Not to be confused with ``Source`` and the other types
-   in ``source.py``, as stored on ``VariableTracker``.
-
-``create_fn`` provides the main functionality to transition from a simple
-dataclass to actually producing valid Python code to be invoked for
-knowing whether or not things have changed in between invocations, and
-whether we can safely read from the code cache or not.
-
-The most common code paths for getting an instance of a guard are
-through ``make_guards`` on ``VariableTracker``.
-``make_guards`` -> ``source.make_guard`` -> ``return Guard(self.name(), self.guard_source(), fn)``
-
-Or, in a concrete example:
-
-.. code-block:: python
-
-   ...
-   elif istype(value, range):
-       guards = self.make_guards(GuardBuilder.EQUALS_MATCH)
-       return RangeVariable(value=value, guards=guards)
-
-Since ``source`` was set at the construction time of this
-``VariableTracker``, all that was needed here was to provide the ``fn``,
-``GuardBuilder.EQUALS_MATCH`` to the ``create_fn`` field.
-
-This ``create_fn`` must be a method on ``GuardBuilder``. The reason for
-this becomes apparent in our next step. Once we have all the guards
-created for a frame, we move on to ``CheckFunctionManager`` and
-``compile_check_fn``.
-
-Before the ``convert_frame`` function can produce a ``GuardedCode``,
-it needs to run the ``CheckFunctionManager``, with all the guards, to
-produce a ``check_fn`` which will then, in turn get passed in alongside
-the code into ``GuardedCode``. This is the same ``check_fn`` that we store in our
-cache entry, and the same one we run to know whether or not to retrieve
-the code stored alongside. For reference, here is that code:
-
-.. code-block:: cpp
-
-   static CacheEntry *create_cache_entry(CacheEntry *next,
-                                         PyObject *guarded_code) {
-     CacheEntry *e = (CacheEntry *)malloc(sizeof(CacheEntry));
-     DEBUG_NULL_CHECK(e);
-     e->check_fn = PyObject_GetAttrString(guarded_code, "check_fn");
-     NULL_CHECK(e->check_fn);
-     e->code = (PyCodeObject *)PyObject_GetAttrString(guarded_code, "code");
-     NULL_CHECK(e->code);
-     e->next = next;
-     return e;
-   }
-
-We now know how a ``check_fn`` function is used, and who makes it, and
-what it is composed of, but what we do not yet know is how. How does a
-list of ``Guard`` objects become a function we can run later on?
-
-First, we iterate these guards:
-
-.. code-block:: python
-
-   for guard in sorted(guards or [], key=Guard.sort_key):
-       if not config.guard_nn_modules and guard.is_nn_module():
-           continue
-       guard.create(local_builder, global_builder)
-
-Calling ``guard.create`` runs that ``create_fn`` we set on the ``Guard``
-class above (don’t confuse it with the ``check_fn`` we are working on
-producing, the names are similar, so it can get a little confusing). In
-our example above, our ``create_fn`` is ``GuardBuilder.EQUALS_MATCH``.
-So we are now invoking it, passing in the ``self``, the guard itself,
-in.
-
-The signature is: ``def EQUALS_MATCH(self, guard: Guard):``
-
-And internally to that function, we can use the ``name`` on the guard to
-get back our original object, querying it for data and type information,
-which in turn gets us to the most important bit: appending code.
-
-At its simplest, ``EQUALS_MATCH`` appends just one line of code:
-``self.code.append(f"{ref} == {val!r}")``. Where ``ref`` is the name of
-the variable, and ``val`` is the value. It might produce code like this:
-
-.. code-block:: python
-
-   y == 2
-
-This is a basic example. But if we append a few other kinds of ``GuardBuilder``
-functions and then combine them all with
-``and`` in between each statement (as we do), we might get something
-like this:
-
-.. code-block:: python
-
-   ___guarded_code.valid and ___check_type_id(y, 94367738391392) and y == 2 and ___check_tensors(x)
-
-Here is what this code performs:
-
-1. A check for ``.valid``
-2. A type ID check
-3. A value check
-4. A tensor check
-
-This becomes the heart of the code our ``check_fn``, which in turn
-is evaluated the **next** time we encounter this code. It
-will then check:
-
-1. Is this code still valid?
-2. If (1), Does ``y`` still have a type of ``94367738391392``?
-3. If (2), is ``y`` still 2?
-4. If (3), let’s check on if tensor ``x`` changed in some specific ways.
-
-If all of these are still true, then we can use the code cached
-alongside this ``check_fn``.
-
-.. note:: For a deeper dive for how and where this happens
-   you can read ``static PyCodeObject *lookup(CacheEntry *e, PyObject *f_locals) {`` of
-   ``_eval_frame.c``.
-
-If not, then, we can move on to recompiling the code anew, and storing
-that in the cache alongside this code, and a whole new ``check_fn``,
-again to be checked on yet another subsequent frame.
-
-There are lots of other such functions on ``GuardBuilder`` which get
-coalesced into, at times massive, strings which then get evaluated as
-Python code and stored into ``check_fn``. The example above
-illustrates of a simple case. To understand this functionality better, read
-the other functions on ``GuardBuilder``, or better yet, dump the ``code`` variable
-in ``compile_check_fn`` to see what is getting produced,
-especially on larger, real models.
-
-Summary
--------
-
-In this section, we have reviewed:
-
-- The role of ``.valid`` and invalidation around weak references (and potentially soon to be NN Moduleinvalidations).
-- How the C++ side of guard functions (``___check_type_id``, ``___check_tensors``, etc) operate.
-- What happens when guards fail.
-- What happens if we produce invalid guard code.
-
-We covered how user provided code wrapped in a TorchDynamo context
-goes on to get traced and tracked internally, organized into ``VariableTracker``\ s
-``Source``\ s and subsequently ``Guard``\ s, and how those ``Guards`` in
-turn guide cache entry selection and invalidation when handing Python
-code.
diff --git a/docs/source/torch.compiler_troubleshooting.rst b/docs/source/torch.compiler_troubleshooting.rst
index eec046b..f98a4dc 100644
--- a/docs/source/torch.compiler_troubleshooting.rst
+++ b/docs/source/torch.compiler_troubleshooting.rst
@@ -589,8 +589,8 @@
 Some graph break reasons are insurmountable to TorchDynamo, and can't be
 easily fixed. - calling into a C extension other than torch is invisible
 to torchdynamo, and could do arbitrary things without TorchDynamo being
-able to introduce necessary `guards <./torch.compiler_guards_overview.rst>`__ to
-ensure that the compiled program would be safe to reuse. Graph breaks
+able to introduce necessary guards (see :ref:`making-dynamo-sound-guards`)
+to ensure that the compiled program would be safe to reuse. Graph breaks
 can hinder performance if the resulting fragments are small. To maximize
 performance, it's important to have as few graph breaks as possible.