1.8. Features for GPU Accelerated Computing

In this section, we introduce the SUNDIALS GPU programming model and highlight SUNDIALS GPU features. The model leverages the fact that all of the SUNDIALS packages interact with simulation data either through the shared vector, matrix, and solver APIs or through user-supplied callback functions. Thus, under the model, the overall structure of the user’s calling program, and the way users interact with the SUNDIALS packages is similar to using SUNDIALS in CPU-only environments.

1.8.1. SUNDIALS GPU Programming Model

As described in [17], within the SUNDIALS GPU programming model, all control logic executes on the CPU, and all simulation data resides wherever the vector or matrix object dictates as long as SUNDIALS is in control of the program. That is, SUNDIALS will not migrate data (explicitly) from one memory space to another. Except in the most advanced use cases, it is safe to assume that data is kept resident in the GPU-device memory space. The consequence of this is that, when control is passed from the user’s calling program to SUNDIALS, simulation data in vector or matrix objects must be up-to-date in the device memory space. Similarly, when control is passed from SUNDIALS to the user’s calling program, the user should assume that any simulation data in vector and matrix objects are up-to-date in the device memory space. To put it succinctly, it is the responsibility of the user’s calling program to manage data coherency between the CPU and GPU-device memory spaces unless unified virtual memory (UVM), also known as managed memory, is being utilized. Typically, the GPU-enabled SUNDIALS modules provide functions to copy data from the host to the device and vice-versa as well as support for unmanaged memory or UVM. In practical terms, the way SUNDIALS handles distinct host and device memory spaces means that users need to ensure that the user-supplied functions, e.g. the right-hand side function, only operate on simulation data in the device memory space otherwise extra memory transfers will be required and performance will suffer. The exception to this rule is if some form of hybrid data partitioning (achievable with the NVECTOR_MANYVECTOR, see §8.22) is utilized.

SUNDIALS provides many native shared features and modules that are GPU-enabled. Currently, these include the NVIDIA CUDA platform [5], AMD ROCm/HIP [2], and Intel oneAPI [3]. Table 1.83–Table 1.86 summarize the shared SUNDIALS modules that are GPU-enabled, what GPU programming environments they support, and what class of memory they support (unmanaged or UVM). Users may also supply their own GPU-enabled N_Vector, SUNMatrix, SUNLinearSolver, or SUNNonlinearSolver implementation, and the capabilities will be leveraged since SUNDIALS operates on data through these APIs.

In addition, SUNDIALS provides a memory management helper module (see §16) to support applications which implement their own memory management or memory pooling.

Table 1.83 List of SUNDIALS GPU-enabled `N_Vector` Modules
NVector	CUDA	ROCm/HIP	oneAPI	Unmanaged Memory	UVM
CUDA	X			X	X
HIP	X	X		X	X
SYCL	X³	X³	X	X	X
RAJA	X	X	X	X	X
KOKKOS	X	X	X	X	X
OPENMPDEV	X	X²	X²	X

Table 1.84 List of SUNDIALS GPU-enabled `SUNMatrix` Modules
SUNMatrix	CUDA	ROCm/HIP	oneAPI	Unmanaged Memory	UVM
CUSPARSE	X			X	X
ONEMKLDENSE	X³	X³	X	X	X
MAGMADENSE	X	X		X	X
GINKGO	X	X		X	X
KOKKOSDENSE	X	X		X	X

Table 1.85 List of SUNDIALS GPU-enabled `SUNLinearSolver` Modules
SUNLinearSolver	CUDA	ROCm/HIP	oneAPI	Unmanaged Memory	UVM
CUSOLVERSP	X			X	X
ONEMKLDENSE	X³	X³	X	X	X
MAGMADENSE	X			X	X
GINKGO	X	X		X	X
KOKKOSDENSE	X	X		X	X
SPGMR	X¹	X¹	X¹	X¹	X¹
SPFGMR	X¹	X¹	X¹	X¹	X¹
SPTFQMR	X¹	X¹	X¹	X¹	X¹
SPBCGS	X¹	X¹	X¹	X¹	X¹
PCG	X¹	X¹	X¹	X¹	X¹

Table 1.86 List of SUNDIALS GPU-enabled `SUNNonlinearSolver` Modules
SUNNonlinearSolver	CUDA	ROCm/HIP	oneAPI	Unmanaged Memory	UVM
NEWTON	X¹	X¹	X¹	X¹	X¹
FIXEDPOINT	X¹	X¹	X¹	X¹	X¹

Notes regarding the above tables:

This module inherits support from the NVECTOR module used
Support for ROCm/HIP and oneAPI are currently untested.
Support for CUDA and ROCm/HIP are currently untested.

In addition, note that implicit UVM (i.e. malloc returning UVM) is not accounted for.

1.8.2. Steps for Using GPU Accelerated SUNDIALS

For any SUNDIALS package, the generalized steps a user needs to take to use GPU accelerated SUNDIALS are:

Utilize a GPU-enabled N_Vector implementation. Initial data can be loaded on the host, but must be in the device memory space prior to handing control to SUNDIALS.
Utilize a GPU-enabled SUNLinearSolver linear solver (if applicable).
Utilize a GPU-enabled SUNMatrix implementation (if using a matrix-based linear solver).
Utilize a GPU-enabled SUNNonlinearSolver nonlinear solver (if applicable).
Write user-supplied functions so that they use data only in the device memory space (again, unless an atypical data partitioning is used). A few examples of these functions are the right-hand side evaluation function, the Jacobian evaluation function, or the preconditioner evaluation function. In the context of CUDA and the right-hand side function, one way a user might ensure data is accessed on the device is, for example, calling a CUDA kernel, which does all of the computation, from a CPU function which simply extracts the underlying device data array from the N_Vector object that is passed from SUNDIALS to the user-supplied function.

Users should refer to the above tables for a complete list of GPU-enabled native SUNDIALS modules.