GPU Architecture¶

The GPU port follows a conservative rule: preserve LESGO numerics and module ordering, then move repeated timestep work to the GPU. The code uses CUDA Fortran, CUF kernel loops, cuFFT, and GPU-aware MPI.

Core Design Rules¶

Rule	Reason
Keep the original solver equations	Makes validation against CPU LESGO meaningful
Keep the z-slab MPI decomposition	Limits architectural risk and preserves existing MPI assumptions
Use GPU kernels for timestep loops	Removes CPU bottlenecks without changing physics
Use persistent contiguous device buffers for MPI	Avoids non-contiguous managed-memory MPI hazards
Keep I/O and one-time parsing on CPU	They do not dominate timestep performance
Keep fallback switches only for core risk areas	Reduces code clutter while preserving debugging ability

Memory Model¶

The build uses NVHPC CUDA Fortran and managed/device allocations where appropriate. The most important performance rule is to avoid accidental host touches of large managed arrays inside the timestep. Host reductions and diagnostics can silently trigger migration and create false bottlenecks.

Prefer full-domain kernels for regular loops:

!$cuf kernel do(3) <<<*,*>>>
do k = kstart, kend
  do j = 1, n2m
    do i = 1, n1m
      field(i,j,k) = ...
    end do
  end do
end do

Use explicit CUDA Fortran kernels when CUF kernel loops create poor launch geometry, too many small launches, or excessive integer flattening overhead.

MPI And GPU Awareness¶

The production multi-GPU path assumes GPU-aware MPI:

export MPICH_GPU_SUPPORT_ENABLED=1

The main MPI safety points are:

Communication Area	Current Pattern
SGS tau halo	Combined contiguous device halo for the 2-rank path
SGS dwdz halo	Device-buffer path retained; further MPI variants were not beneficial
Pressure RHS halo	Combined contiguous device halo for the 2-rank path
Pressure transpose	Specialized nproc==2 transpose-Thomas helper
ATM point-owner LB	Experimental targeted device exchange path

Synchronization Policy¶

Synchronization Type	Policy
Before MPI consumes GPU data	Required
After diagnostic GPU events	Only when timing is enabled
Per-small-kernel strict sync	Disabled by default
Global debug sync	Available through `LESGO_MPI_CUDA_SYNC`

When adding a new MPI exchange, pack into a contiguous device buffer, synchronize once before MPI if necessary, exchange, unpack on GPU, and validate with divergence, KE, wall stress, and module-specific checks.