Skip to content

GPU Architecture

The GPU port follows a conservative rule: preserve LESGO numerics and module ordering, then move repeated timestep work to the GPU. The code uses CUDA Fortran, CUF kernel loops, cuFFT, and GPU-aware MPI.

Core Design Rules

Rule Reason
Keep the original solver equations Makes validation against CPU LESGO meaningful
Keep the z-slab MPI decomposition Limits architectural risk and preserves existing MPI assumptions
Use GPU kernels for timestep loops Removes CPU bottlenecks without changing physics
Use persistent contiguous device buffers for MPI Avoids non-contiguous managed-memory MPI hazards
Keep I/O and one-time parsing on CPU They do not dominate timestep performance
Keep fallback switches only for core risk areas Reduces code clutter while preserving debugging ability

Memory Model

The build uses NVHPC CUDA Fortran and managed/device allocations where appropriate. The most important performance rule is to avoid accidental host touches of large managed arrays inside the timestep. Host reductions and diagnostics can silently trigger migration and create false bottlenecks.

Prefer full-domain kernels for regular loops:

!$cuf kernel do(3) <<<*,*>>>
do k = kstart, kend
  do j = 1, n2m
    do i = 1, n1m
      field(i,j,k) = ...
    end do
  end do
end do

Use explicit CUDA Fortran kernels when CUF kernel loops create poor launch geometry, too many small launches, or excessive integer flattening overhead.

MPI And GPU Awareness

The production multi-GPU path assumes GPU-aware MPI:

export MPICH_GPU_SUPPORT_ENABLED=1

The main MPI safety points are:

Communication Area Current Pattern
SGS tau halo Combined contiguous device halo for the 2-rank path
SGS dwdz halo Device-buffer path retained; further MPI variants were not beneficial
Pressure RHS halo Combined contiguous device halo for the 2-rank path
Pressure transpose Specialized nproc==2 transpose-Thomas helper
ATM point-owner LB Experimental targeted device exchange path

Synchronization Policy

Synchronization Type Policy
Before MPI consumes GPU data Required
After diagnostic GPU events Only when timing is enabled
Per-small-kernel strict sync Disabled by default
Global debug sync Available through LESGO_MPI_CUDA_SYNC

When adding a new MPI exchange, pack into a contiguous device buffer, synchronize once before MPI if necessary, exchange, unpack on GPU, and validate with divergence, KE, wall stress, and module-specific checks.