Skip to content

Validation And Performance

This page keeps two separate validation records:

  • the default 128^3 half-channel physics validation requested for turbulence realism;
  • the short 480 x 240 x 240 actuator-turbine benchmark used for CPU/GPU performance comparisons.

Default Half-Channel Validation

This page records the physics validation requested for the standard LESGO half-channel case: no turbines, pressure-gradient forcing, rough-wall lower boundary, and periodic horizontal directions. The purpose is not a short deterministic CPU/GPU bitwise check; it is to confirm that the GPU port still produces a physically turbulent channel-flow solution.

Case Setup

Item CPU Run GPU Run
Grid 128 x 128 x 128 128 x 128 x 128
Active physics Default dynamic/Lagrangian SGS Default dynamic/Lagrangian SGS
Turbines Off Off
Runtime Restarted at 50,000, continued to 100,000 Restarted at 50,000, continued to 100,000
Averaging window 50,000-100,000 50,000-100,000
Hardware layout 32 MPI ranks 2 MPI ranks / 2 GPUs

Scalar Checks

Run Final Divergence Final KE Bottom Wall Stress
CPU, 32 MPI 0.2276639E-12 0.2311216E+03 0.1009177E+01
GPU, 2 MPI / 2 GPU 0.3354102E-06 0.2286064E+03 0.9768977E+00

These values are close enough for a turbulent long-run validation. The instantaneous trajectory is expected to decorrelate because small floating-point differences grow chaotically in turbulence.

Mean Velocity

The mean velocity profile is compared against the rough-wall log-law trend. CPU and GPU are shown on identical semi-log axes, with wall distance z/z0 on the horizontal axis and U+ on the vertical axis.

CPU and GPU mean velocity profiles for the default 128 cubed half-channel case
Metric Value
Mean velocity L1 difference 2.43078E-01
Mean velocity relative L2 difference 1.19896E-02

Reynolds Stresses

Second-order statistics converge more slowly than the mean profile. The CPU and GPU profiles are compared side by side over the 50,000-100,000 window; the component shapes and near-wall behavior remain consistent, while the remaining differences should be treated as finite-time turbulent sampling error rather than pointwise trajectory error.

CPU and GPU Reynolds-stress profiles for the default 128 cubed half-channel case
Metric Value
u'u' L1 difference 1.27736E-01
u'u' relative L2 difference 6.49681E-02
-u'w' L1 difference 4.48119E-02
-u'w' relative L2 difference 1.16428E-01

Instantaneous Z-Plane And PDF

The mid-plane contours compare the instantaneous u' field at z/H = 0.5 and step 100,000. These are not expected to match pointwise after a long chaotic turbulent integration. The contours are therefore used only to check that both runs show physically turbulent structures, while the normalized PDF compares the instantaneous fluctuation distribution more directly.

CPU and GPU instantaneous z-plane velocity fluctuation contours for the default 128 cubed half-channel case
Instantaneous metric CPU GPU
u' RMS on z/H=0.5 plane 1.33458 1.75819
Normalized PDF skewness -0.119722 -0.425260
Normalized PDF kurtosis 2.49917 2.70626
Plane file vel.z-0.50000.100000.c15.bin vel.z-0.50000.100000.c0.bin
PDF Agreement Metric Value
L1 distance between normalized PDFs 1.95679E-01

Horizontal Energy Spectrum

The spectrum below is computed from the full 3D instantaneous velocity snapshots at step 100,000. Because this is a wall-bounded channel, the transform is applied only in the periodic horizontal directions; the horizontal mean is removed at each height, all three velocity components are included, and the result is averaged over z.

CPU and GPU horizontal energy spectra for the default 128 cubed half-channel case
Spectrum Metric Value
L1 distance between normalized horizontal spectra 2.52855E-01
Snapshot files vel.100000.c*.bin

Interpretation

The GPU result passes the current physical validation gate for this stage: the mean profile follows the expected log-law trend, the Reynolds-stress profiles have the correct structure, and the instantaneous mid-plane field shows developed turbulent streaks and patches rather than laminar behavior.

480x240x240 ATM Benchmark

This is the short no-I/O verification case for the actuator turbine model at 480 x 240 x 240. The comparison uses the same case setup and reports compute time only.

Test Setup

Item Setting
Case test-cases/actuator_turbine_model
Grid Nx=480, Ny=240, Nz=240
Active module USE_ATM=ON
Output policy Heavy domain/plane output disabled for timing runs
CPU sweep 24, 40, 60, 80, 120 MPI ranks; 3 steps
GPU timing A100 runs, average of steps 2-10
GPU configurations 1 GPU / 1 MPI and 2 GPUs / 2 MPI measured on same-node A100 runs

Runtime Summary

Run Step Time Speedup vs Best CPU Notes
Best CPU 0.634 s/step 1.0x 120 MPI ranks
1 GPU / 1 MPI 0.103 s/step 6.1x A100, optimized default path
2 GPUs / 2 MPI 0.061 s/step 10.4x Same-node A100 run
480x240x240 step time over iterations
GPU scaling chart for the 480 workload

CPU Sweep

The CPU baseline is selected from this short rank sweep.

CPU line sweep for the 480 workload

Module Breakdown

CPU and GPU module timing breakdown for the 480 workload

Flow-Field Verification

The figure compares the z=2.5 velocity plane at step 10. The left and center panels show the CPU and GPU u field; the right panel shows the absolute difference.

CPU and GPU flow-field comparison on the z=2.5 plane
Component L1 Mean Error L2 Error Max Error
u 4.97E-16 6.60E-16 3.11E-15
v 1.57E-16 2.06E-16 9.98E-16
w 7.75E-17 1.02E-16 5.06E-16

Scalar Checks

Run Divergence KE Bot Wall Stress
CPU, 2 MPI, step 10 0.2681714E-03 0.4998491E+00 0.8686115E-05
1 GPU / 1 MPI, step 10 0.2681679E-03 0.4998491E+00 0.8686115E-05
2 GPUs / 2 MPI, step 10 0.2681714E-03 0.4998491E+00 0.8686115E-05

Reproduce

cd /glade/u/home/wchen/lesgo-gpu-test/test-cases/actuator_turbine_model
qsub job_compare_cpu120.pbs
qsub job_compare_gpu1_noio.pbs
qsub job_compare_gpu2_noio.pbs