[AMDGPU][NFC] Add DWARF extension support for SIMD execution

- Add current iteration to the context of a DWARF expression evaluation. - Add DW_AT_LLVM_iterations attribute to specify the number of iterations executing concurrently. - Add DF_OP_LLVM_push_iteration to support optimizations that result in multiple iterations executing concurrently. - Add DW_OP_LLVM_overlay and DW_OP_LLVM_bit_overlay to support expressing the location of arrays that are promoted to vector registers in SIMD vectorized loops. - Generally clarify the difference between SIMT and SIMD execution. - Change the DW_AT_LLVM_active_lane attribute to take location description expression so that a loclist can be used to express different vales at different program locations. Reviewed By: scott.linder Differential Revision: https://reviews.llvm.org/D117572
2022-01-08 18:48:14 +00:00 · 2022-01-08 18:48:14 +00:00 · 8ba5043dbf
parent d083a0e589
commit 8ba5043dbf
1 changed files with 277 additions and 59 deletions
--- a/llvm/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.rst
+++ b/llvm/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.rst
@ -462,9 +462,11 @@ If the source language is mapped onto the AMDGPU wavefronts in a SIMT manner,
 then the variable DWARF location expressions must compute the location for a
 single lane of the wavefront. Therefore, a DWARF operation is required to denote
 the current lane, much like ``DW_OP_push_object_address`` denotes the current
-object.
+object. See ``DW_OP_LLVM_push_lane`` in :ref:`amdgpu-dwarf-literal-operations`.

-See ``DW_OP_LLVM_push_lane`` in :ref:`amdgpu-dwarf-base-type-entries`.
+In addition, a way is needed for the compiler to communicate how many source
+language threads of execution are mapped to a target architecture thread's SIMT
+lanes. See ``DW_AT_LLVM_lanes`` in :ref:`amdgpu-dwarf-low-level-information`.

 .. _amdgpu-dwarf-support-for-divergent-control-flow-of-simt-hardware:

@ -555,6 +557,76 @@ by the AMDGPU, is added.

 See :ref:`amdgpu-dwarf-language-names-table`.

+2.19 Support for Source Language Optimizations that Result in Concurrent Iteration Execution
+--------------------------------------------------------------------------------------------
+
+A compiler can perform loop optimizations that result in the generated code
+executing multiple iterations concurrently. For example, software pipelining
+schedules multiple iterations in an interleaved fashion to allow the
+instructions of one iteration to hide the latencies of the instructions of
+another iteration. Another example is vectorization that can exploit SIMD
+hardware to allow a single instruction to execute multiple iterations using
+vector registers.
+
+Note that although this is similar to SIMT execution, the way a client debugger
+uses the information is fundamentally different. In SIMT execution the debugger
+needs to present the concurrent execution as distinct source language threads
+that the user can list and switch focus between. With iteration concurrency
+optimizations, such as software pipelining and vectorized SIMD, the debugger
+must not present the concurrency as distinct source language threads. Instead,
+it must inform the user that multiple loop iterations are executing in parallel
+and allow the user to select between them.
+
+In general, SIMT execution fixes the number of concurrent executions per target
+architecture thread. However, both software pipelining and SIMD vectorization
+may vary the number of concurrent iterations for different loops executed by a
+single source language thread.
+
+It is possible for the compiler to use both SIMT concurrency and iteration
+concurrency techniques in the code of a single source language thread.
+
+Therefore, a DWARF operation is required to denote the current concurrent
+iteration instance, much like ``DW_OP_push_object_address`` denotes the current
+object. See ``DW_OP_LLVM_push_iteration`` in
+:ref:`amdgpu-dwarf-literal-operations`.
+
+In addition, a way is needed for the compiler to communicate how many source
+language loop iterations are executing concurrently. See
+``DW_AT_LLVM_iterations`` in :ref:`amdgpu-dwarf-low-level-information`.
+
+2.20 DWARF Operation to Create Runtime Overlay Composite Location Description
+-----------------------------------------------------------------------------
+
+It is common in SIMD vectorization for the compiler to generate code that
+promotes portions of an array into vector registers. For example, if the
+hardware has vector registers with 8 elements, and 8 wide SIMD instructions, the
+compiler may vectorize a loop so that is executes 8 iterations concurrently for
+each vectorized loop iteration.
+
+On the first iteration of the generated vectorized loop, iterations 0 to 7 of
+the source language loop will be executed using SIMD instructions. Then on the
+next iteration of the generated vectorized loop, iteration 8 to 15 will be
+executed, and so on.
+
+If the source language loop accesses an array element based on the loop
+iteration index, the compiler may read the element into a register for the
+duration of that iteration. Next iteration it will read the next element into
+the register, and so on. With SIMD, this generalizes to the compiler reading
+array elements 0 to 7 into a vector register on the first vectorized loop
+iteration, then array elements 8 to 15 on the next iteration, and so on.
+
+The DWARF location description for the array needs to express that all elements
+are in memory, except the slice that has been promoted to the vector register.
+The starting position of the slice is a runtime value based on the iteration
+index modulo the vectorization size. This cannot be expressed by ``DW_OP_piece``
+and ``DW_OP_bit_piece`` which only allow constant offsets to be expressed.
+
+Therefore, a new operator is defined that takes two location descriptions, an
+offset and a size, and creates a composite that effectively uses the second
+location description as an overlay of the first, positioned according to the
+offset and size. See ``DW_OP_LLVM_overlay`` and ``DW_OP_LLVM_bit_overlay`` in
+:ref:`amdgpu-dwarf-composite-location-description-operations`.
+
 .. _amdgpu-dwarf-changes-relative-to-dwarf-version-5:

 A. Changes Relative to DWARF Version 5
@ -599,10 +671,11 @@ The following table provides the additional attributes.
   =========================== ====================================
   Attribute                   Usage
   =========================== ====================================
-   ``DW_AT_LLVM_active_lane``  SIMD or SIMT active lanes (see :ref:`amdgpu-dwarf-low-level-information`)
+   ``DW_AT_LLVM_active_lane``  SIMT active lanes (see :ref:`amdgpu-dwarf-low-level-information`)
   ``DW_AT_LLVM_augmentation`` Compilation unit augmentation string (see :ref:`amdgpu-dwarf-full-and-partial-compilation-unit-entries`)
-   ``DW_AT_LLVM_lane_pc``      SIMD or SIMT lane program location (see :ref:`amdgpu-dwarf-low-level-information`)
-   ``DW_AT_LLVM_lanes``        SIMD or SIMT thread lane count (see :ref:`amdgpu-dwarf-low-level-information`)
+   ``DW_AT_LLVM_lane_pc``      SIMT lane program location (see :ref:`amdgpu-dwarf-low-level-information`)
+   ``DW_AT_LLVM_lanes``        SIMT lane count (see :ref:`amdgpu-dwarf-low-level-information`)
+   ``DW_AT_LLVM_iterations``   Concurrent iteration count (see :ref:`amdgpu-dwarf-low-level-information`)
   ``DW_AT_LLVM_vector_size``  Base type vector size (see :ref:`amdgpu-dwarf-base-type-entries`)
   =========================== ====================================

@ -679,33 +752,59 @@ elements that can be specified are:

 *A current thread*

-  The target architecture thread identifier of the source program thread of
-  execution for which a user presented expression is currently being evaluated.
+  The target architecture thread identifier. For source languages that are not
+  implemented using a SIMT execution model, this corresponds to the source
+  program thread of execution for which a user presented expression is currently
+  being evaluated. For source languages that are implemented using a SIMT
+  execution model, this together with the current lane corresponds to the source
+  program thread of execution for which a user presented expression is currently
+  being evaluated.

  It is required for operations that are related to target architecture threads.

  *For example, the* ``DW_OP_regval_type`` *operation, or the*
  ``DW_OP_form_tls_address`` *and* ``DW_OP_LLVM_form_aspace_address``
-  *operations when given an address space that is thread specific.*
+  *operations when given an address space that is target architecture thread
+  specific.*

 *A current lane*

-  The target architecture lane identifier of the source program thread of
-  execution for which a user presented expression is currently being evaluated.
-  This applies to languages that are implemented using a SIMD or SIMT execution
-  model.
+  The 0 based SIMT lane identifier to be used in evaluating a user presented
+  expression. This applies to source languages that are implemented for a target
+  architecture using a SIMT execution model. These implementations map source
+  language threads of execution to lanes of the target architecture threads.

-  It is required for operations that are related to target architecture lanes.
+  It is required for operations that are related to SIMT lanes.

  *For example, the* ``DW_OP_LLVM_push_lane`` *operation and*
  ``DW_OP_LLVM_form_aspace_address`` *operation when given an address space that
-  is lane specific.*
+  is SIMT lane specific.*

-  If specified, it must be consistent with any specified current thread and
-  current target architecture. It is consistent with a thread if it identifies a
-  lane of the thread. It is consistent with a target architecture if it is a
-  valid lane identifier of the target architecture. Otherwise the result is
-  undefined.
+  If specified, it must be consistent with the value of the ``DW_AT_LLVM_lanes``
+  attribute of the subprogram corresponding to context's frame and program
+  location. It is consistent if the value is greater than or equal to 0 and less
+  than the, possibly default, value of the ``DW_AT_LLVM_lanes`` attribute.
+  Otherwise the result is undefined.
+
+*A current iteration*
+
+  The 0 based source language iteration instance to be used in evaluating a user
+  presented expression. This applies to target architectures that support
+  optimizations that result in executing multiple source language loop iterations
+  concurrently.
+
+  *For example, software pipelining and SIMD vectorization.*
+
+  It is required for operations that are related to source language loop
+  iterations.
+
+  *For example, the* ``DW_OP_LLVM_push_iteration`` *operation.*
+
+  If specified, it must be consistent with the value of the
+  ``DW_AT_LLVM_iterations`` attribute of the subprogram corresponding to
+  context's frame and program location. It is consistent if the value is greater
+  than or equal to 0 and less than the, possibly default, value of the
+  ``DW_AT_LLVM_iterations`` attribute. Otherwise the result is undefined.

 *A current call frame*

@ -1520,12 +1619,30 @@ size and the low-order bits used.

 8.  ``DW_OP_LLVM_push_lane`` *New*

-    ``DW_OP_LLVM_push_lane`` pushes the target architecture lane identifier of
-    the current lane as a value with the generic type.
+    ``DW_OP_LLVM_push_lane`` pushes the current lane as a value with the generic
+    type.

-    *For languages that are implemented using a SIMD or SIMT execution model,
-    this is the lane number that corresponds to the source language thread of
-    execution upon which the user is focused.*
+    *For source languages that are implemented using a SIMT execution model,
+    this is the zero-based lane number that corresponds to the source language
+    thread of execution upon which the user is focused.*
+
+    The value must be greater than or equal to 0 and less than the value of the
+    ``DW_AT_LLVM_lanes`` attribute, otherwise the DWARF expression is
+    ill-formed. See :ref:`amdgpu-dwarf-low-level-information`.
+
+9.  ``DW_OP_LLVM_push_iteration`` *New*
+
+    ``DW_OP_LLVM_push_iteration`` pushes the current iteration as a value with
+    the generic type.
+
+    *For source language implementations with optimizations that cause multiple
+    loop iterations to execute concurrently, this is the zero-based iteration
+    number that corresponds to the source language concurrent loop iteration
+    upon which the user is focused.*
+
+    The value must be greater than or equal to 0 and less than the value of the
+    ``DW_AT_LLVM_iterations`` attribute, otherwise the DWARF expression is
+    ill-formed. See :ref:`amdgpu-dwarf-low-level-information`.

 .. _amdgpu-dwarf-arithmetic-logical-operations:

@ -2090,11 +2207,11 @@ type.

    *For example, if AS is for per thread storage then LS is the location
    storage for the current thread. For languages that are implemented using a
-    SIMD or SIMT execution model, then if AS is for per lane storage then LS is
-    the location storage for the current lane of the current thread. Therefore,
-    if L is accessed by an operation, the location storage selected when the
-    location description was created is accessed, and not the location storage
-    associated with the current context of the access operation.*
+    SIMT execution model, then if AS is for per lane storage then LS is the
+    location storage for the current lane of the current thread. Therefore, if L
+    is accessed by an operation, the location storage selected when the location
+    description was created is accessed, and not the location storage associated
+    with the current context of the access operation.*

    The DWARF expression is ill-formed if AS is not one of the values defined by
    the target architecture specific ``DW_ASPACE_*`` values.
@ -2699,6 +2816,57 @@ compatible with the definitions in DWARF Version 5.*
    The DWARF expression is ill-formed if S or C are 0, or if the bit size of M
    is less than C.

+6.  ``DW_OP_LLVM_overlay`` *New*
+
+    ``DW_OP_LLVM_overlay`` pops four stack entries. The first must be an
+    integral type value that represents the overlay byte size value S. The
+    second must be an integral type value that represents the overlay byte
+    offset value O. The third must be a location description that represents the
+    overlay location description OL. The fourth must be a location description
+    that represents the base location description BL.
+
+    The action is the same as for ``DW_OP_LLVM_bit_overlay``, except that the
+    overlay bit size BS and overlay bit offset BO used are S and O respectively
+    scaled by 8 (the byte size).
+
+7.  ``DW_OP_LLVM_bit_overlay`` *New*
+
+    ``DW_OP_LLVM_bit_overlay`` pops four stack entries. The first must be an
+    integral type value that represents the overlay bit size value BS. The
+    second must be an integral type value that represents the overlay bit offset
+    value BO. The third must be a location description that represents the
+    overlay location description OL. The fourth must be a location description
+    that represents the base location description BL.
+
+    The DWARF expression is ill-formed if BS or BO are negative values.
+
+    *rbss(L)* is the minimum remaining bit storage size of L which is defined as
+    follows. LS is the location storage and LO is the location bit offset
+    specified by a single location descriptions SL of L. The remaining bit
+    storage size RBSS of SL is the bit size of LS minus LO. *rbss(L)* is the
+    minimum RBSS of each single location description SL of L.
+
+    The DWARF expression is ill-formed if *rbss(BL)* is less than BO plus BS.
+
+    If BS is 0, then the operation pushes BL.
+
+    If BO is 0 and BS equals *rbss(BL)*, then the operation pushes OL.
+
+    Otherwise, the operation is equivalent to performing the following steps to
+    push a composite location description.
+
+    *The composite location description is conceptually the base location
+    description BL with the overlay location description OL positioned as an
+    overlay starting at the overlay offset BO and covering overlay bit size BS.*
+
+    1.  If BO is not 0 then push BL followed by performing the ``DW_OP_bit_piece
+        BO`` operation.
+    2.  Push OL followed by performing the ``DW_OP_bit_piece BS`` operation.
+    3.  If *rbss(BL)* is greater than BO plus BS, push BL followed by performing
+        the ``DW_OP_LLVM_bit_offset (BO + BS); DW_OP_bit_piece (rbss(BL) - BO -
+        BS)`` operations.
+    4.  Perform the ``DW_OP_LLVM_piece_end`` operation.
+
 .. _amdgpu-dwarf-location-list-expressions:

 A.2.5.5 DWARF Location List Expressions
@ -3180,19 +3348,26 @@ A.3.3.5 Low-Level Information

      The following new attributes are added.

-4.  For languages that are implemented using a SIMD or SIMT execution model, a
+4.  For languages that are implemented using a SIMT execution model, a
    ``DW_TAG_subprogram``, ``DW_TAG_inlined_subroutine``, or
    ``DW_TAG_entry_point`` debugger information entry may have a
    ``DW_AT_LLVM_lanes`` attribute whose value is an integer constant that is
-    the number of lanes per thread. This is the static number of lanes per
-    thread. It is not the dynamic number of lanes with which the thread was
-    initiated, for example, due to smaller or partial work-groups.
+    the number of source language threads of execution per target architecture
+    thread.
+
+    *For example, a compiler may map source language threads of execution onto
+    lanes of a target architecture thread using a SIMT execution model.*
+
+    It is the static number of source language threads of execution per target
+    architecture thread. It is not the dynamic number of source language threads
+    of execution with which the target architecture thread was initiated, for
+    example, due to smaller or partial work-groups.

    If not present, the default value of 1 is used.

-    The DWARF is ill-formed if the value is 0.
+    The DWARF is ill-formed if the value is less than or equal to 0.

-5.  For languages that are implemented using a SIMD or SIMT execution model, a
+5.  For source languages that are implemented using a SIMT execution model, a
    ``DW_TAG_subprogram``, ``DW_TAG_inlined_subroutine``, or
    ``DW_TAG_entry_point`` debugging information entry may have a
    ``DW_AT_LLVM_lane_pc`` attribute whose value is a DWARF expression E.
@ -3203,52 +3378,87 @@ A.3.3.5 Low-Level Information
    elements corresponding to the source language thread of execution upon which
    the user is focused, if any.

-    The resulting location description L is for a thread lane count sized vector
-    of generic type elements. The thread lane count is the value of the
+    The resulting location description L is for a lane count sized vector of
+    generic type elements. The lane count is the value of the
    ``DW_AT_LLVM_lanes`` attribute. Each element holds the conceptual program
-    location of the corresponding lane, where the least significant element
-    corresponds to the first target architecture specific lane identifier and so
-    forth. If the lane was not active when the current subprogram was called,
-    its element is an undefined location description.
+    location of the corresponding lane. If the lane was not active when the
+    current subprogram was called, its element is an undefined location
+    description.
+
+    The DWARF is ill-formed if L does not have exactly one single location
+    description.

    ``DW_AT_LLVM_lane_pc`` *allows the compiler to indicate conceptually where
-    each lane of a SIMT thread is positioned even when it is in divergent
-    control flow that is not active.*
+    each SIMT lane of a target architecture thread is positioned even when it is
+    in divergent control flow that is not active.*

    *Typically, the result is a location description with one composite location
    description with each part being a location description with either one
    undefined location description or one memory location description.*

-    If not present, the thread is not being used in a SIMT manner, and the
-    thread's current program location is used.
+    If not present, the target architecture thread is not being used in a SIMT
+    manner, and the thread's current program location is used.

-6.  For languages that are implemented using a SIMD or SIMT execution model, a
+6.  For languages that are implemented using a SIMT execution model, a
    ``DW_TAG_subprogram``, ``DW_TAG_inlined_subroutine``, or
    ``DW_TAG_entry_point`` debugger information entry may have a
    ``DW_AT_LLVM_active_lane`` attribute whose value is a DWARF expression E.

-    The result of the attribute is obtained by evaluating E with a context that
-    has a result kind of a value, an unspecified object, the compilation unit
-    that contains E, an empty initial stack, and other context elements
-    corresponding to the source language thread of execution upon which the user
-    is focused, if any.
+    E is evaluated with a context that has a result kind of a location
+    description, an unspecified object, the compilation unit that contains E, an
+    empty initial stack, and other context elements corresponding to the source
+    language thread of execution upon which the user is focused, if any.

-    The DWARF is ill-formed if the resulting value V is not an integral value.
+    The DWARF is ill-formed if L does not have exactly one single location
+    description SL.

-    The resulting V is a bit mask of active lanes for the current program
-    location. The N\ :sup:`th` least significant bit of the mask corresponds to
-    the N\ :sup:`th` lane. If the bit is 1 the lane is active, otherwise it is
-    inactive.
+    The active lane bit mask V for the current program location is obtained by
+    reading from SL using a target architecture specific integral base type T
+    that has a bit size equal to the value of the ``DW_AT_LLVM_lanes`` attribute
+    of the subprogram corresponding to context's frame and program location. The
+    N\ :sup:`th` least significant bit of the mask corresponds to the N\
+    :sup:`th` lane. If the bit is 1 the lane is active, otherwise it is
+    inactive. The result of the attribute is the value V.

    *Some targets may update the target architecture execution mask for regions
    of code that must execute with different sets of lanes than the current
    active lanes. For example, some code must execute with all lanes made
    temporarily active.* ``DW_AT_LLVM_active_lane`` *allows the compiler to
-    provide the means to determine the source language active lanes.*
+    provide the means to determine the source language active lanes at any
+    program location. Typically, this attribute will use a loclist to express
+    different locations of the active lane mask at different program locations.*

    If not present and ``DW_AT_LLVM_lanes`` is greater than 1, then the target
    architecture execution mask is used.

+7.  A ``DW_TAG_subprogram``, ``DW_TAG_inlined_subroutine``, or
+    ``DW_TAG_entry_point`` debugger information entry may have a
+    ``DW_AT_LLVM_iterations`` attribute whose value is an integer constant or a
+    DWARF expression E. Its value is the number of source language loop
+    iterations executing concurrently by the target architecture for a single
+    source language thread of execution.
+
+    *A compiler may generate code that executes more than one iteration of a
+    source language loop concurrently using optimization techniques such as
+    software pipelining or SIMD vectorization. The number of concurrent
+    iterations may vary for different loop nests in the same subprogram.
+    Typically, this attribute will use a loclist to express different values at
+    different program locations.*
+
+    If the attribute is an integer constant, then the value is the constant. The
+    DWARF is ill-formed if the constant is less than or equal to 0.
+
+    Otherwise, E is evaluated with a context that has a result kind of a
+    location description, an unspecified object, the compilation unit that
+    contains E, an empty initial stack, and other context elements corresponding
+    to the source language thread of execution upon which the user is focused,
+    if any. The DWARF is ill-formed if the result is not a location description
+    comprised of one implicit location description, that when read as the
+    generic type, results in a value V that is less than or equal to 0. The
+    result of the attribute is the value V.
+
+    If not present, the default value of 1 is used.
+
 A.3.4 Call Site Entries and Parameters
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

@ -3985,8 +4195,9 @@ operations that can be used in E have the following restrictions:
 * ``DW_OP_push_object_address`` is not allowed because there is no object
  context to provide a value to push.

-* ``DW_OP_LLVM_push_lane`` is not allowed because the call frame instructions
-  describe the actions for the whole thread, not the lanes independently.
+* ``DW_OP_LLVM_push_lane`` and ``DW_OP_LLVM_push_iteration`` are not allowed
+  because the call frame instructions describe the actions for the whole target
+  architecture thread, not the lanes or iterations independently.

 * ``DW_OP_call_frame_cfa`` and ``DW_OP_entry_value`` are not allowed because
  their use would be circular.
@ -4325,6 +4536,7 @@ entry attributes.
   DW_AT_LLVM_lanes                   0x3e0a constant
   DW_AT_LLVM_lane_pc                 0x3e0b exprloc, loclist
   DW_AT_LLVM_vector_size             0x3e0c constant
+   DW_AT_LLVM_iterations              0x3e0a constant, exprloc, loclist
   ================================== ====== ===================================

 .. _amdgpu-dwarf-classes-and-forms:
@ -4381,6 +4593,9 @@ operations.
                                                     ULEB128 count
   DW_OP_LLVM_select_bit_piece        0xec     2     ULEB128 bit size,
                                                     ULEB128 count
+   DW_OP_LLVM_push_iteration          TBA      0
+   DW_OP_LLVM_overlay                 TBA      0
+   DW_OP_LLVM_bit_overlay             TBA      0
   ================================== ===== ======== ===============================

 A.7.7.3 Location List Expressions
@ -4496,12 +4711,15 @@ debugger information entries.
   ``DW_TAG_entry_point``        * ``DW_AT_LLVM_active_lane``
                                 * ``DW_AT_LLVM_lane_pc``
                                 * ``DW_AT_LLVM_lanes``
+                                 * ``DW_AT_LLVM_iterations``
   ``DW_TAG_inlined_subroutine`` * ``DW_AT_LLVM_active_lane``
                                 * ``DW_AT_LLVM_lane_pc``
                                 * ``DW_AT_LLVM_lanes``
+                                 * ``DW_AT_LLVM_iterations``
   ``DW_TAG_subprogram``         * ``DW_AT_LLVM_active_lane``
                                 * ``DW_AT_LLVM_lane_pc``
                                 * ``DW_AT_LLVM_lanes``
+                                 * ``DW_AT_LLVM_iterations``
   ============================= =============================

 .. _amdgpu-dwarf-examples: