Main Timestep Flow¶
The current GPU branch keeps the original LESGO timestep structure. The port changes where work is executed and how data is exchanged, not the physical model ordering.
Runtime Sections¶
The main loop reports the following cumulative module sections. The latest post-cleanup short validation used the actuator turbine model case on Derecho.
| Section | Main Purpose | Primary Files | 1 MPI / 1 GPU Step-10 Time | 2 MPI / 2 GPU Step-10 Time | GPU Porting Status |
|---|---|---|---|---|---|
| Forcing | ATM, inflow/fringe, applied forcing | forcing.f90, atm_lesgo_interface.f90, actuator_turbine_model.f90 |
0.001192 s | 0.001106 s | GPU-enabled; point-owner LB remains experimental |
| Derivatives | Spatial derivatives, filtering support | derivatives.f90, test_filtermodule.f90 |
0.017776 s | 0.009083 s | GPU-enabled |
| SGS & Stresses | Dynamic SGS, stress tensor, divstress | sgs_stag_util.f90, divstress_uv.f90, divstress_w.f90 |
0.022613 s | 0.012779 s | GPU-enabled with optimized 2-rank halo |
| Convection | Nonlinear convective terms | convec.f90 and reference variants |
0.039461 s | 0.020110 s | GPU-enabled |
| Pressure Solver | RHS, cuFFT, transpose-Thomas, inverse | press_stag_array.f90, tridag_array.f90 |
0.010828 s | 0.012686 s | GPU-enabled; multi-GPU pressure remains communication-sensitive |
| Projection | Velocity projection and pressure gradient correction | forcing.f90, pressure/projection helpers |
0.001794 s | 0.001102 s | GPU-enabled |
| Other | Timers, reductions, residual overhead | main.f90, diagnostics, MPI bookkeeping |
0.011531 s | 0.006615 s | Mostly CPU bookkeeping plus unavoidable synchronization |
Timestep Order¶
initialize state and device mirrors
for each timestep:
update forcing / ATM state
compute derivatives
build SGS model and stresses
compute convection
solve pressure Poisson equation
project velocity field
update diagnostics and output counters
finalize
The GPU port preserves this ordering. Optimizations focused on reducing data migration, fusing or enlarging underfilled kernels, using CUDA-aware MPI with contiguous buffers, and avoiding unnecessary cudaDeviceSynchronize() calls.
Data Ownership¶
The main solver still uses the original one-dimensional z-slab decomposition. Each MPI rank owns a slab in z, including halo regions required by derivatives, SGS, pressure RHS, and projection. Multi-GPU work maps one local MPI rank to one GPU.
Pressure is the main exception internally: the optimized pressure solver uses a pressure-only transpose-Thomas helper so full vertical lines can be solved efficiently while preserving the outer z-slab decomposition.
Numerical Release Gates¶
| Diagnostic | Purpose |
|---|---|
| Velocity divergence metric | Projection/pressure correctness |
| Kinetic energy | Global flow consistency |
| Bottom wall stress | Boundary and SGS/wall model consistency |
| Force sums and turbine quantities | ATM correctness when forcing paths change |
| Module timing | Regression detection |
| Case | Divergence | KE | Bot Wall Stress | Step Time |
|---|---|---|---|---|
| 1 MPI / 1 GPU | 0.2681679E-03 |
0.4998491E+00 |
0.8686115E-05 |
0.1051949 s |
| 2 MPI / 2 GPU | 0.2681714E-03 |
0.4998491E+00 |
0.8686115E-05 |
0.0634820 s |