Skip to content

Main Timestep Flow

The current GPU branch keeps the original LESGO timestep structure. The port changes where work is executed and how data is exchanged, not the physical model ordering.

Runtime Sections

The main loop reports the following cumulative module sections. The latest post-cleanup short validation used the actuator turbine model case on Derecho.

Section Main Purpose Primary Files 1 MPI / 1 GPU Step-10 Time 2 MPI / 2 GPU Step-10 Time GPU Porting Status
Forcing ATM, inflow/fringe, applied forcing forcing.f90, atm_lesgo_interface.f90, actuator_turbine_model.f90 0.001192 s 0.001106 s GPU-enabled; point-owner LB remains experimental
Derivatives Spatial derivatives, filtering support derivatives.f90, test_filtermodule.f90 0.017776 s 0.009083 s GPU-enabled
SGS & Stresses Dynamic SGS, stress tensor, divstress sgs_stag_util.f90, divstress_uv.f90, divstress_w.f90 0.022613 s 0.012779 s GPU-enabled with optimized 2-rank halo
Convection Nonlinear convective terms convec.f90 and reference variants 0.039461 s 0.020110 s GPU-enabled
Pressure Solver RHS, cuFFT, transpose-Thomas, inverse press_stag_array.f90, tridag_array.f90 0.010828 s 0.012686 s GPU-enabled; multi-GPU pressure remains communication-sensitive
Projection Velocity projection and pressure gradient correction forcing.f90, pressure/projection helpers 0.001794 s 0.001102 s GPU-enabled
Other Timers, reductions, residual overhead main.f90, diagnostics, MPI bookkeeping 0.011531 s 0.006615 s Mostly CPU bookkeeping plus unavoidable synchronization

Timestep Order

initialize state and device mirrors
for each timestep:
  update forcing / ATM state
  compute derivatives
  build SGS model and stresses
  compute convection
  solve pressure Poisson equation
  project velocity field
  update diagnostics and output counters
finalize

The GPU port preserves this ordering. Optimizations focused on reducing data migration, fusing or enlarging underfilled kernels, using CUDA-aware MPI with contiguous buffers, and avoiding unnecessary cudaDeviceSynchronize() calls.

Data Ownership

The main solver still uses the original one-dimensional z-slab decomposition. Each MPI rank owns a slab in z, including halo regions required by derivatives, SGS, pressure RHS, and projection. Multi-GPU work maps one local MPI rank to one GPU.

Pressure is the main exception internally: the optimized pressure solver uses a pressure-only transpose-Thomas helper so full vertical lines can be solved efficiently while preserving the outer z-slab decomposition.

Numerical Release Gates

Diagnostic Purpose
Velocity divergence metric Projection/pressure correctness
Kinetic energy Global flow consistency
Bottom wall stress Boundary and SGS/wall model consistency
Force sums and turbine quantities ATM correctness when forcing paths change
Module timing Regression detection
Case Divergence KE Bot Wall Stress Step Time
1 MPI / 1 GPU 0.2681679E-03 0.4998491E+00 0.8686115E-05 0.1051949 s
2 MPI / 2 GPU 0.2681714E-03 0.4998491E+00 0.8686115E-05 0.0634820 s