llvm-project/openmp/libomptarget/docs/declare_target_indirect.md

6.4 KiB

Overview

The indirect clause enables indirect device invocation for a procedure:

19 An indirect call to the device version of a procedure on a device other than the host
20 device, through a function pointer (C/C++), a pointer to a member function (C++) or
21 a procedure pointer (Fortran) that refers to the host version of the procedure.

Compiler support

Offload entry metadata (C++ FE)

For each function declared as declare target indirect C++ FE generates the following offload metadata:

// Entry 0 -> Kind of this type of metadata (2)
// Entry 1 -> Mangled name of the function.
// Entry 2 -> Order the entry was created.

The offloading metadata uses new OffloadEntriesInfoManagerTy::OffloadingEntryInfoKinds::OffloadingEntryInfoDeviceIndirectFunc metadata kind.

Offload entries table

The offload entries table that is created for the host and for each of the device images currently have entries for declare target global variables, omp target outlined functions and constructor/destructor thunks for declare target global variables.

Compiler will also produce an entry for each procedure listed in indirect clause of declare target construct:

struct __tgt_offload_entry {
  void *addr;       // Pointer to the function
  char *name;       // Name of the function
  size_t size;      // 0 for function
  int32_t flags;    // OpenMPOffloadingDeclareTargetFlags::OMP_DECLARE_TARGET_FPTR
  int32_t reserved; // Reserved
};

Run-time dispatch in device code

When an indirect function call is generated by a FE in device code it translates the original function pointer (which may be an address of a host function) into the device function pointer using a translation API, and uses the resulting function pointer for the call.

Original call code:

  %0 = load void ()*, void ()** %fptr.addr
  call void %0()

Becomes this:

  %0 = load void ()*, void ()** %fptr.addr
  %1 = bitcast void ()* %0 to i8*
  %call = call i8* @__kmpc_target_translate_fptr(i8* %1)
  %fptr_device = bitcast i8* %call to void ()*
  call void %fptr_device()

Device RTLs must provide the translation API:

// Translate \p FnPtr identifying a host function into a function pointer
// identifying its device counterpart.
// If \p FnPtr matches an address of any host function
// declared as 'declare target indirect', then the API
// must return an address of the same function compiled
// for the device. If \p FnPtr does not match an address
// of any host function, then the API returns \p FnPtr
// unchanged.
EXTERN void *__kmpc_target_translate_fptr(void *FnPtr);

Runtime handling of function pointers

OpenMPOffloadingDeclareTargetFlags::OMP_DECLARE_TARGET_FPTR is a new flag to distinguish offload entries for function pointers from other function entries. Unlike other function entries (with size equal to 0) omptarget::InitLibrary() will establish mapping for function pointer entries in Device.HostDataToTargetMap.

For each OMP_DECLARE_TARGET_FPTR entry in the offload entries table libomptarget creates an entry of the following type:

struct __omp_offloading_fptr_map_ty {
  int64_t host_ptr; // key
  int64_t tgt_ptr;  // value
};

Where host_ptr is __tgt_offload_entry::addr in a host offload entry, and tgt_ptr is __tgt_offload_entry::addr in the corresponding device offload entry (which may be found using the populated Device.HostDataToTargetMap).

When all __omp_offloading_function_ptr_map_ty entries are collected in a single host array, libomptarget sorts the table by host_ptr values and passes it to the device plugin for registration, if plugin supports optional __tgt_rtl_set_function_ptr_map API.

Plugins may provide the following API, if they want to support declare target indirect functionality:

// Register in a target implementation defined way a table
// of __omp_offloading_function_ptr_map_ty entries providing
// mapping between host and device addresses of 'declare target indirect'
// functions. \p table_size is the number of elements in \p table_host_ptr
// array.
EXTERN void __tgt_rtl_set_function_ptr_map(
    int32_t device_id, uint64_t table_size, __omp_offloading_fptr_map_ty *table_host_ptr);

Sample implementation

This section describes one of potential implementations.

A FE may define the following global symbols for each translation module containing declare target indirect, when compiling this module for a device:

// Mapping between host and device functions declared as
// 'declare target indirect'.
__attribute__((weak)) struct __omp_offloading_fptr_map_ty {
  int64_t host_ptr; // key
  int64_t tgt_ptr;  // value
} *__omp_offloading_fptr_map_p = 0;

// Number of elements in __omp_offloading_fptr_map_p table.
__attribute__((weak)) uint64_t __omp_offloading_fptr_map_size = 0;

__tgt_rtl_set_function_ptr_map(int32_t device_id, uint64_t table_size, __omp_offloading_fptr_map_ty *table_host_ptr) allocates device memory of size sizeof(__omp_offloading_fptr_map_ty) * table_size, and transfers the contents of table_host_ptr array into this device memory. An address of the allocated device memory area is then assigned to __omp_offloading_fptr_map_p global variables on the device. For example, in CUDA, a device address of __omp_offloading_fptr_map_p may be taken by calling cuModuleGetGlobal, and then a pointer-sized data transfer will initialize __omp_offloading_fptr_map_p to point to the device copy of table_host_ptr array. __omp_offloading_fptr_map_size is assigned to table_size the same way.

An alternative implementation of __tgt_rtl_set_function_ptr_map may invoke a device kernel that will do the assignments.

__kmpc_target_translate_fptr(void *FnPtr) API uses binary search to match FnPtr against host_ptr inside the device table pointed to by __omp_offloading_fptr_map_p. If the matching key is found, it returns the corresponding tgt_ptr, otherwise, it returns FnPtr.

TODO: Optimization for non-unified_shared_memory

If a program does not use required unified_shared_memory, and all function pointers are mapped (not a requirement by OpenMP spec), then an implementation may avoid the runtime dispatch code for indirect function calls (i.e. __kmpc_target_translate_fptr is not needed) and also __tgt_rtl_set_function_ptr_map is not needed. libomptarget will just map the function pointers as regular data pointers via Device.HostDataToTargetMap.