.. _amdgpu-async-operations:

===============================
 AMDGPU Asynchronous Operations
===============================

.. contents::
   :local:

Introduction
============

Asynchronous operations are memory transfers (usually between the global memory
and LDS) that are completed independently at an unspecified scope. A thread that
requests one or more asynchronous transfers can use *asyncmarks* to track
their completion. The thread waits for each asyncmark to be *completed*, which
indicates that requests initiated in *program-order* before this asyncmark have also
completed.

Operations
==========

Memory Accesses
---------------

The following instructions request asynchronous transfer of data between global
memory and LDS memory.

.. note::

   These listings are *merely representative*. The actual function signatures
   and supported architectures are documented in the :ref:`amdgpu-usage-guide`.

**GFX9 Async Instructions (LDS DMA)**

.. code-block:: llvm

  void @llvm.amdgcn.load.async.to.lds(ptr %src, ptr %dst)
  void @llvm.amdgcn.global.load.async.lds(ptr %src, ptr %dst)
  void @llvm.amdgcn.raw.buffer.load.async.lds(ptr %src, ptr %dst)
  void @llvm.amdgcn.raw.ptr.buffer.load.async.lds(ptr %src, ptr %dst)
  void @llvm.amdgcn.struct.buffer.load.async.lds(ptr %src, ptr %dst)
  void @llvm.amdgcn.struct.ptr.buffer.load.async.lds(ptr %src, ptr %dst)

**GFX12 Async Instructions**

.. code-block:: llvm

  void @llvm.amdgcn.global.load.async.to.lds.type(ptr %dst, ptr %src)
  void @llvm.amdgcn.global.store.async.from.lds.type(ptr %dst, ptr %src)
  void @llvm.amdgcn.cluster.load.async.to.lds.type(ptr %dst, ptr %src)

**GFX1250 Tensor DMA Instructions**

.. code-block:: llvm

  void @llvm.amdgcn.tensor.load.to.lds(...)
  void @llvm.amdgcn.tensor.store.from.lds(...)

Asyncmark Operations
---------------------

An *asyncmark* in the abstract machine tracks all the async operations that
are *program-ordered* before that asyncmark. An asyncmark M is said to be *completed*
only when all async operations *program-ordered* before M are reported by the
implementation as having finished, and it is said to be *outstanding* otherwise.

Thus we have the following sufficient condition:

  An async operation X is *completed* at a program point P if there exists an
  asyncmark M such that X is *program-ordered* before M, M is *program-ordered* before
  P, and M is completed. X is said to be *outstanding* at P otherwise.

The abstract machine maintains a sequence of asyncmarks during the
execution of a function body, which excludes any asyncmarks produced by calls to
other functions encountered in the currently executing function.

``@llvm.amdgcn.asyncmark()``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When executed, inserts an asyncmark in the sequence associated with the
currently executing function body.

``@llvm.amdgcn.wait.asyncmark(i16 %N)``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Waits until there are at most N outstanding asyncmarks in the sequence associated
with the currently executing function body.

Memory Consistency Model
========================

Each asynchronous operation consists of a non-atomic read on the source and a
non-atomic write on the destination. Async "LDS DMA" intrinsics result in async
accesses that guarantee visibility relative to other memory operations as
follows:

  An asynchronous operation `A` program ordered before an overlapping memory
  operation `X` happens-before `X` only if `A` is completed before `X`.

  A memory operation `X` program ordered before an overlapping asynchronous
  operation `A` happens-before `A`.

.. note::

   The *only if* in the above wording implies that unlike the default LLVM
   memory model, certain program order edges are not automatically included in
   ``happens-before``.

Examples
========

Uneven blocks of async transfers
--------------------------------

.. code-block:: c++

   void foo(global int *g, local int *l) {
     // first block
     async_load_to_lds(l, g);
     async_load_to_lds(l, g);
     async_load_to_lds(l, g);
     asyncmark();

     // second block; longer
     async_load_to_lds(l, g);
     async_load_to_lds(l, g);
     async_load_to_lds(l, g);
     async_load_to_lds(l, g);
     async_load_to_lds(l, g);
     asyncmark();

     // third block; shorter
     async_load_to_lds(l, g);
     async_load_to_lds(l, g);
     asyncmark();

     // Wait for first block
     wait.asyncmark(2);
   }

Software pipeline
-----------------

.. code-block:: c++

   void foo(global int *g, local int *l) {
     // first block
     asyncmark();

     // second block
     asyncmark();

     // third block
     asyncmark();

     for (;;) {
       wait.asyncmark(2);
       // use data

       // next block
       asyncmark();
     }

     // flush one block
     wait.asyncmark(2);

     // flush one more block
     wait.asyncmark(1);

     // flush last block
     wait.asyncmark(0);
   }

Ordinary function call
----------------------

.. code-block:: c++

   extern void bar(); // may or may not make async calls

   void foo(global int *g, local int *l) {
       // first block
       asyncmark();

       // second block
       asyncmark();

       // function call
       bar();

       // third block
       asyncmark();

       wait.asyncmark(1); // wait for the second block
       wait.asyncmark(0); // will wait for third block, including bar()
   }

Implementation notes
====================

[This section is informational.]

Optimization
------------

The implementation may eliminate asyncmark/wait intrinsics in the following cases:

1. An ``asyncmark`` operation which is not included in the wait count of a later
   wait operation in the current function. In particular, an ``asyncmark`` which
   is not post-dominated by any ``wait.asyncmark``.
2. A ``wait.asyncmark`` whose wait count is more than the outstanding async
   asyncmarks at that point. In particular, a ``wait.asyncmark`` that is not
   dominated by any ``asyncmark``.

In general, at a function call, if the caller uses sufficient waits to track
its own async operations, the actions performed by the callee cannot affect
correctness. But inlining such a call may result in redundant waits.

.. code-block:: c++

   void foo() {
     asyncmark(); // A
   }

   void bar() {
     asyncmark(); // B
     asyncmark(); // C
     foo();
     wait.asyncmark(1);
   }

Before inlining, the ``wait.asyncmark`` waits for asyncmark B to be completed.

.. code-block:: c++

   void foo() {
   }

   void bar() {
     asyncmark(); // B
     asyncmark(); // C
     asyncmark(); // A from call to foo()
     wait.asyncmark(1);
   }

After inlining, the ``wait.asyncmark`` now waits for asyncmark C to complete, which is
longer than necessary. Ideally, the optimizer should have eliminated asyncmark A in
the body of foo() itself.
