|  | ================== | 
|  | Vectorization Plan | 
|  | ================== | 
|  |  | 
|  | .. contents:: | 
|  | :local: | 
|  |  | 
|  | Abstract | 
|  | ======== | 
|  | The vectorization transformation can be rather complicated, involving several | 
|  | potential alternatives, especially for outer-loops [1]_ but also possibly for | 
|  | innermost loops. These alternatives may have significant performance impact, | 
|  | both positive and negative. A cost model is therefore employed to identify the | 
|  | best alternative, including the alternative of avoiding any transformation | 
|  | altogether. | 
|  |  | 
|  | The Vectorization Plan is an explicit model for describing vectorization | 
|  | candidates. It serves for both optimizing candidates including estimating their | 
|  | cost reliably, and for performing their final translation into IR. This | 
|  | facilitates dealing with multiple vectorization candidates. | 
|  |  | 
|  | Current Status | 
|  | ============== | 
|  | VPlan is currently used to drive code-generation in LoopVectorize. VPlans are | 
|  | constructed after all cost-based and most legality-related decisions have been | 
|  | taken. As def-use chains between recipes are now fully modeled in VPlan, | 
|  | VPlan-based analyses and transformations are used to simplify and modularize | 
|  | the vectorization process [10]_. Those include transformations to | 
|  |  | 
|  | 1. Legalize the initial VPlan, e.g. by introducing specialized recipes for | 
|  | reductions and interleave groups. | 
|  |  | 
|  | 2. Optimize the legalized VPlan, e.g. by removing redundant recipes or | 
|  | introducing active-lane-masks. | 
|  |  | 
|  | 3. Apply unroll- and vectorization-factor specific optimizations, e.g. removing | 
|  | the backedge to reiterate the vector loop based on VF & UF. | 
|  |  | 
|  | Refer to :numref:`fig-vplan-transform-pipeline` for an overview of the current | 
|  | transformation pipeline. | 
|  |  | 
|  | Note that some legality checks are already done in VPlan, including checking if | 
|  | all users of a fixed-order recurrence can be re-ordered. This is implemented as | 
|  | a VPlan-to-VPlan transformation that either applies a valid re-ordering or | 
|  | bails out marking the VPlan as invalid. | 
|  |  | 
|  | .. _fig-vplan-transform-pipeline: | 
|  | .. figure:: ./vplan-transform-pipeline.png | 
|  | :width: 800 px | 
|  |  | 
|  | VPlan Transformation Pipeline in 2024 | 
|  |  | 
|  |  | 
|  | VPlan currently models the complete vector loop, as well as additional parts of | 
|  | the vectorization skeleton. Refer to :numref:`fig-vplan-scope` for an overview | 
|  | of the scope covered by VPlan. | 
|  |  | 
|  | .. _fig-vplan-scope: | 
|  | .. figure:: ./vplan-scope.png | 
|  | :width: 800 px | 
|  |  | 
|  | Scope modeled in VPlan in 2024 | 
|  |  | 
|  |  | 
|  | High-level Design | 
|  | ================= | 
|  |  | 
|  | Vectorization Workflow | 
|  | ---------------------- | 
|  | VPlan-based vectorization involves three major steps, taking a "scenario-based | 
|  | approach" to vectorization planning: | 
|  |  | 
|  | 1. Legal Step: check if a loop can be legally vectorized; encode constraints and | 
|  | artifacts if so. | 
|  | 2. Plan Step: | 
|  |  | 
|  | a. Build initial VPlans following the constraints and decisions taken by | 
|  | Legal Step 1, and compute their cost. | 
|  | b. Apply optimizations to the VPlans, possibly forking additional VPlans. | 
|  | Prune sub-optimal VPlans having relatively high cost. | 
|  | 3. Execute Step: materialize the best VPlan. Note that this is the only step | 
|  | that modifies the IR. | 
|  |  | 
|  | Design Guidelines | 
|  | ----------------- | 
|  | In what follows, the term "input IR" refers to code that is fed into the | 
|  | vectorizer whereas the term "output IR" refers to code that is generated by the | 
|  | vectorizer. The output IR contains code that has been vectorized or "widened" | 
|  | according to a loop Vectorization Factor (VF), and/or loop unroll-and-jammed | 
|  | according to an Unroll Factor (UF). | 
|  | The design of VPlan follows several high-level guidelines: | 
|  |  | 
|  | 1. Analysis-like: building and manipulating VPlans must not modify the input IR. | 
|  | In particular, if the best option is not to vectorize at all, the | 
|  | vectorization process terminates before reaching Step 3, and compilation | 
|  | should proceed as if VPlans had not been built. | 
|  |  | 
|  | 2. Align Cost & Execute: each VPlan must support both estimating the cost and | 
|  | generating the output IR code, such that the cost estimation evaluates the | 
|  | to-be-generated code reliably. | 
|  |  | 
|  | 3. Support vectorizing additional constructs: | 
|  |  | 
|  | a. Outer-loop vectorization. In particular, VPlan must be able to model the | 
|  | control-flow of the output IR which may include multiple basic-blocks and | 
|  | nested loops. | 
|  | b. SLP vectorization. | 
|  | c. Combinations of the above, including nested vectorization: vectorizing | 
|  | both an inner loop and an outer-loop at the same time (each with its own | 
|  | VF and UF), mixed vectorization: vectorizing a loop with SLP patterns | 
|  | inside [4]_, (re)vectorizing input IR containing vector code. | 
|  | d. Function vectorization [2]_. | 
|  |  | 
|  | 4. Support multiple candidates efficiently. In particular, similar candidates | 
|  | related to a range of possible VF's and UF's must be represented efficiently. | 
|  | Potential versioning needs to be supported efficiently. | 
|  |  | 
|  | 5. Support vectorizing idioms, such as interleaved groups of strided loads or | 
|  | stores. This is achieved by modeling a sequence of output instructions using | 
|  | a "Recipe", which is responsible for computing its cost and generating its | 
|  | code. | 
|  |  | 
|  | 6. Encapsulate Single-Entry Single-Exit regions (SESE). During vectorization | 
|  | such regions may need to be, for example, predicated and linearized, or | 
|  | replicated VF*UF times to handle scalarized and predicated instructions. | 
|  | Innerloops are also modelled as SESE regions. | 
|  |  | 
|  | 7. Support instruction-level analysis and transformation, as part of Planning | 
|  | Step 2.b: During vectorization instructions may need to be traversed, moved, | 
|  | replaced by other instructions or be created. For example, vector idiom | 
|  | detection and formation involves searching for and optimizing instruction | 
|  | patterns. | 
|  |  | 
|  | Definitions | 
|  | =========== | 
|  | The low-level design of VPlan comprises of the following classes. | 
|  |  | 
|  | :LoopVectorizationPlanner: | 
|  | A LoopVectorizationPlanner is designed to handle the vectorization of a loop | 
|  | or a loop nest. It can construct, optimize and discard one or more VPlans, | 
|  | each VPlan modelling a distinct way to vectorize the loop or the loop nest. | 
|  | Once the best VPlan is determined, including the best VF and UF, this VPlan | 
|  | drives the generation of output IR. | 
|  |  | 
|  | :VPlan: | 
|  | A model of a vectorized candidate for a given input IR loop or loop nest. This | 
|  | candidate is represented using a Hierarchical CFG. VPlan supports estimating | 
|  | the cost and driving the generation of the output IR code it represents. | 
|  |  | 
|  | :Hierarchical CFG: | 
|  | A control-flow graph whose nodes are basic-blocks or Hierarchical CFG's. The | 
|  | Hierarchical CFG data structure is similar to the Tile Tree [5]_, where | 
|  | cross-Tile edges are lifted to connect Tiles instead of the original | 
|  | basic-blocks as in Sharir [6]_, promoting the Tile encapsulation. The terms | 
|  | Region and Block are used rather than Tile [5]_ to avoid confusion with loop | 
|  | tiling. | 
|  |  | 
|  | :VPBlockBase: | 
|  | The building block of the Hierarchical CFG. A pure-virtual base-class of | 
|  | VPBasicBlock and VPRegionBlock, see below. VPBlockBase models the hierarchical | 
|  | control-flow relations with other VPBlocks. Note that in contrast to the IR | 
|  | BasicBlock, a VPBlockBase models its control-flow successors and predecessors | 
|  | directly, rather than through a Terminator branch or through predecessor | 
|  | branches that "use" the VPBlockBase. | 
|  |  | 
|  | :VPBasicBlock: | 
|  | VPBasicBlock is a subclass of VPBlockBase, and serves as the leaves of the | 
|  | Hierarchical CFG. It represents a sequence of output IR instructions that will | 
|  | appear consecutively in an output IR basic-block. The instructions of this | 
|  | basic-block originate from one or more VPBasicBlocks. VPBasicBlock holds a | 
|  | sequence of zero or more VPRecipes that model the cost and generation of the | 
|  | output IR instructions. | 
|  |  | 
|  | :VPRegionBlock: | 
|  | VPRegionBlock is a subclass of VPBlockBase. It models a collection of | 
|  | VPBasicBlocks and VPRegionBlocks which form a SESE subgraph of the output IR | 
|  | CFG. A VPRegionBlock may indicate that its contents are to be replicated a | 
|  | constant number of times when output IR is generated, effectively representing | 
|  | a loop with constant trip-count that will be completely unrolled. This is used | 
|  | to support scalarized and predicated instructions with a single model for | 
|  | multiple candidate VF's and UF's. | 
|  |  | 
|  | :VPRecipeBase: | 
|  | A pure-virtual base class modeling a sequence of one or more output IR | 
|  | instructions, possibly based on one or more input IR instructions. These | 
|  | input IR instructions are referred to as "Ingredients" of the Recipe. A Recipe | 
|  | may specify how its ingredients are to be transformed to produce the output IR | 
|  | instructions; e.g., cloned once, replicated multiple times or widened | 
|  | according to selected VF. | 
|  |  | 
|  | :VPValue: | 
|  | The base of VPlan's def-use relations class hierarchy. When instantiated, it | 
|  | models a constant or a live-in Value in VPlan. It has users, which are of type | 
|  | VPUser, but no operands. | 
|  |  | 
|  | :VPUser: | 
|  | A VPUser represents an entity that uses a number of VPValues as operands. | 
|  | VPUser is similar in some aspects to LLVM's User class. | 
|  |  | 
|  | :VPDef: | 
|  | A VPDef represents an entity that defines zero, one or multiple VPValues. | 
|  | It is used to model the fact that recipes in VPlan can define multiple | 
|  | VPValues. | 
|  |  | 
|  | :VPInstruction: | 
|  | A VPInstruction is a recipe characterized by a single opcode and optional | 
|  | flags, free of ingredients or other meta-data. VPInstructions also extend | 
|  | LLVM IR's opcodes with idiomatic operations that enrich the Vectorizer's | 
|  | semantics. | 
|  |  | 
|  | :VPTransformState: | 
|  | Stores information used for generating output IR, passed from | 
|  | LoopVectorizationPlanner to its selected VPlan for execution, and used to pass | 
|  | additional information down to VPBlocks and VPRecipes. | 
|  |  | 
|  | The Planning Process and VPlan Roadmap | 
|  | ====================================== | 
|  |  | 
|  | Transforming the Loop Vectorizer to use VPlan follows a staged approach. First, | 
|  | VPlan was only used to record the final vectorization decisions, and to execute | 
|  | them: the Hierarchical CFG models the planned control-flow, and Recipes capture | 
|  | decisions taken inside basic-blocks. Currently, VPlan is used also as the basis | 
|  | for taking these decisions, effectively turning them into a series of | 
|  | VPlan-to-VPlan algorithms. Finally, VPlan will support the planning process | 
|  | itself including cost-based analyses for making these decisions, to fully | 
|  | support compositional and iterative decision making. | 
|  |  | 
|  | Some decisions are local to an instruction in the loop, such as whether to widen | 
|  | it into a vector instruction or replicate it, keeping the generated instructions | 
|  | in place. Other decisions, however, involve moving instructions, replacing them | 
|  | with other instructions, and/or introducing new instructions. For example, a | 
|  | cast may sink past a later instruction and be widened to handle first-order | 
|  | recurrence; an interleave group of strided gathers or scatters may effectively | 
|  | move to one place where they are replaced with shuffles and a common wide vector | 
|  | load or store; new instructions may be introduced to compute masks, shuffle the | 
|  | elements of vectors, and pack scalar values into vectors or vice-versa. | 
|  |  | 
|  | In order for VPlan to support making instruction-level decisions and analyses, | 
|  | it needs to model the relevant instructions along with their def/use relations. | 
|  | This too follows a staged approach: first, the new instructions that compute | 
|  | masks are modeled as VPInstructions, along with their induced def/use subgraph. | 
|  | This effectively models masks in VPlan, facilitating VPlan-based predication. | 
|  | Next, the logic embedded within each Recipe for generating its instructions at | 
|  | VPlan execution time, will instead take part in the planning process by modeling | 
|  | them as VPInstructions. Finally, only logic that applies to instructions as a | 
|  | group will remain in Recipes, such as interleave groups and potentially other | 
|  | idiom groups having synergistic cost. | 
|  |  | 
|  | Related LLVM components | 
|  | ----------------------- | 
|  | 1. SLP Vectorizer: one can compare the VPlan model with LLVM's existing SLP | 
|  | tree, where TSLP [3]_ adds Plan Step 2.b. | 
|  |  | 
|  | 2. RegionInfo: one can compare VPlan's H-CFG with the Region Analysis as used by | 
|  | Polly [7]_. | 
|  |  | 
|  | 3. Loop Vectorizer: the Vectorization Plan aims to upgrade the infrastructure of | 
|  | the Loop Vectorizer and extend it to handle outer loops [8]_, [9]_. | 
|  |  | 
|  | References | 
|  | ---------- | 
|  | .. [1] "Outer-loop vectorization: revisited for short SIMD architectures", Dorit | 
|  | Nuzman and Ayal Zaks, PACT 2008. | 
|  |  | 
|  | .. [2] "Proposal for function vectorization and loop vectorization with function | 
|  | calls", Xinmin Tian, [`cfe-dev | 
|  | <http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html>`_]., | 
|  | March 2, 2016. | 
|  | See also `review <https://reviews.llvm.org/D22792>`_. | 
|  |  | 
|  | .. [3] "Throttling Automatic Vectorization: When Less is More", Vasileios | 
|  | Porpodas and Tim Jones, PACT 2015 and LLVM Developers' Meeting 2015. | 
|  |  | 
|  | .. [4] "Exploiting mixed SIMD parallelism by reducing data reorganization | 
|  | overhead", Hao Zhou and Jingling Xue, CGO 2016. | 
|  |  | 
|  | .. [5] "Register Allocation via Hierarchical Graph Coloring", David Callahan and | 
|  | Brian Koblenz, PLDI 1991 | 
|  |  | 
|  | .. [6] "Structural analysis: A new approach to flow analysis in optimizing | 
|  | compilers", M. Sharir, Journal of Computer Languages, Jan. 1980 | 
|  |  | 
|  | .. [7] "Enabling Polyhedral Optimizations in LLVM", Tobias Grosser, Diploma | 
|  | thesis, 2011. | 
|  |  | 
|  | .. [8] "Introducing VPlan to the Loop Vectorizer", Gil Rapaport and Ayal Zaks, | 
|  | European LLVM Developers' Meeting 2017. | 
|  |  | 
|  | .. [9] "Extending LoopVectorizer: OpenMP4.5 SIMD and Outer Loop | 
|  | Auto-Vectorization", Intel Vectorizer Team, LLVM Developers' Meeting 2016. | 
|  |  | 
|  | .. [10] "VPlan: Status Update and Future Roadmap", Ayal Zaks and Florian Hahn, | 
|  | LLVM Developers' Meeting 2023, https://www.youtube.com/watch?v=SzGP4PgMuLE |