1.10. Features for GPU Accelerated Computing
In this section, we introduce the SUNDIALS GPU programming model and highlight SUNDIALS GPU features. The model leverages the fact that all of the SUNDIALS packages interact with simulation data either through the shared vector, matrix, and solver APIs or through user-supplied callback functions. Thus, under the model, the overall structure of the user’s calling program, and the way users interact with the SUNDIALS packages is similar to using SUNDIALS in CPU-only environments.
1.10.1. SUNDIALS GPU Programming Model
As described in [14], within the SUNDIALS GPU programming model, all control logic executes on the CPU, and all simulation data resides wherever the vector or matrix object dictates as long as SUNDIALS is in control of the program. That is, SUNDIALS will not migrate data (explicitly) from one memory space to another. Except in the most advanced use cases, it is safe to assume that data is kept resident in the GPU-device memory space. The consequence of this is that, when control is passed from the user’s calling program to SUNDIALS, simulation data in vector or matrix objects must be up-to-date in the device memory space. Similarly, when control is passed from SUNDIALS to the user’s calling program, the user should assume that any simulation data in vector and matrix objects are up-to-date in the device memory space. To put it succinctly, it is the responsibility of the user’s calling program to manage data coherency between the CPU and GPU-device memory spaces unless unified virtual memory (UVM), also known as managed memory, is being utilized. Typically, the GPU-enabled SUNDIALS modules provide functions to copy data from the host to the device and vice-versa as well as support for unmanaged memory or UVM. In practical terms, the way SUNDIALS handles distinct host and device memory spaces means that users need to ensure that the user-supplied functions, e.g. the right-hand side function, only operate on simulation data in the device memory space otherwise extra memory transfers will be required and performance will suffer. The exception to this rule is if some form of hybrid data partitioning (achievable with the NVECTOR_MANYVECTOR, see §8.22) is utilized.
SUNDIALS provides many native shared features and modules that are
GPU-enabled. Currently, these include the NVIDIA CUDA platform
[5], AMD ROCm/HIP [2], and Intel oneAPI
[3]. Table 1.4–Table 1.7
summarize the shared SUNDIALS modules that are GPU-enabled, what GPU
programming environments they support, and what class of memory they support
(unmanaged or UVM). Users may also supply their own GPU-enabled
N_Vector
, SUNMatrix
, SUNLinearSolver
, or
SUNNonlinearSolver
implementation, and the capabilties will be
leveraged since SUNDIALS operates on data through these APIs.
In addition, SUNDIALS provides a memory management helper module (see §13) to support applications which implement their own memory management or memory pooling.
Module |
CUDA |
ROCm/HIP |
oneAPI |
Unmanaged Memory |
UVM |
---|---|---|---|---|---|
X |
X |
X |
|||
X |
X |
X |
X |
||
X3 |
X3 |
X |
X |
X |
|
X |
X |
X |
X |
X |
|
X |
X |
X |
X |
X |
|
X |
X2 |
X2 |
X |
Module |
CUDA |
ROCm/HIP |
oneAPI |
Unmanaged Memory |
UVM |
---|---|---|---|---|---|
X |
X |
X |
|||
X3 |
X3 |
X |
X |
X |
|
X |
X |
X |
X |
||
X |
X |
X |
X |
||
X |
X |
X |
X |
Module |
CUDA |
ROCm/HIP |
oneAPI |
Unmanaged Memory |
UVM |
---|---|---|---|---|---|
X |
X |
X |
|||
X3 |
X3 |
X |
X |
X |
|
X |
X |
X |
|||
X |
X |
X |
X |
||
X |
X |
X |
X |
||
X1 |
X1 |
X1 |
X1 |
X1 |
|
X1 |
X1 |
X1 |
X1 |
X1 |
|
X1 |
X1 |
X1 |
X1 |
X1 |
|
X1 |
X1 |
X1 |
X1 |
X1 |
|
X1 |
X1 |
X1 |
X1 |
X1 |
Module |
CUDA |
ROCm/HIP |
oneAPI |
Unmanaged Memory |
UVM |
---|---|---|---|---|---|
X1 |
X1 |
X1 |
X1 |
X1 |
|
X1 |
X1 |
X1 |
X1 |
X1 |
Notes regarding the above tables:
This module inherits support from the NVECTOR module used
Support for ROCm/HIP and oneAPI are currently untested.
Support for CUDA and ROCm/HIP are currently untested.
In addition, note that implicit UVM (i.e. malloc
returning UVM) is not
accounted for.
1.10.2. Steps for Using GPU Accelerated SUNDIALS
For any SUNDIALS package, the generalized steps a user needs to take to use GPU accelerated SUNDIALS are:
Utilize a GPU-enabled
N_Vector
implementation. Initial data can be loaded on the host, but must be in the device memory space prior to handing control to SUNDIALS.Utilize a GPU-enabled
SUNLinearSolver
linear solver (if applicable).Utilize a GPU-enabled
SUNMatrix
implementation (if using a matrix-based linear solver).Utilize a GPU-enabled
SUNNonlinearSolver
nonlinear solver (if applicable).Write user-supplied functions so that they use data only in the device memory space (again, unless an atypical data partitioning is used). A few examples of these functions are the right-hand side evaluation function, the Jacobian evalution function, or the preconditioner evaulation function. In the context of CUDA and the right-hand side function, one way a user might ensure data is accessed on the device is, for example, calling a CUDA kernel, which does all of the computation, from a CPU function which simply extracts the underlying device data array from the
N_Vector
object that is passed from SUNDIALS to the user-supplied function.
Users should refer to the above tables for a complete list of GPU-enabled native SUNDIALS modules.