Computational Analysis
This document derives compute, communication, memory, and I/O costs for the main loop in this project and ties them to expected strong/weak scaling behavior and benchmark methodology.
Assumptions
- Global grid:
Nx × Ny
, uniform spacingsΔx, Δy
. - MPI 2D Cartesian decomposition:
Px × Py
ranks,p = Px · Py
. - Per-rank interior tile:
nx × ny
wherenx ≈ Nx/Px
,ny ≈ Ny/Py
(last ranks may get remainders). - Halo width:
h = 1
(5-point diffusion, 1st-order upwind advection). - Data type: 8-byte double.
- One field updated per step (
u
), with a ping-pong buffer (tmp
).
Main Loop Pseudocode
for n in 0..steps-1:
exchange_halos(u)
apply_boundary(u)
compute_diffusion(u, tmp)
compute_advection(u, tmp)
swap(u, tmp)
if (n % out_every == 0): write_snapshot(u,n); report_stats(u,n)
Per-step Complexity (per rank)
1) Halo exchange
Faces exchanged once per step (nonblocking). For $h = 1$:
- Left/right faces: $2 \times (ny \cdot h)$ doubles
- Down/up faces: $2 \times (nx \cdot h)$ doubles
Latency cost: 4 messages per step (L/R/D/U).
Bandwidth cost: proportional to $nx+ny$ bytes.
2) Boundary conditions
Applied only at physical boundaries. Per-rank cost is $O(nx + ny)$ (copy/mirror ghost rings). Interior ranks pay ~ $O(1)$ if neighbors exist.
3) Diffusion (5-point stencil, explicit)
Each interior point touches constant neighbors:
$W_{\mathrm{diff}}(nx, ny) = \Theta(nx \cdot ny)$
Arithmetic intensity (naive): ~(a few flops) per 5 loads + 1 store → memory bound unless tiled.
4) Advection (1st-order upwind)
Constant work per point (two 1D upwind derivatives):
\[W_{\mathrm{adv}}(nx, ny) = \Theta(nx \cdot ny)\]Similar memory behavior to diffusion (more loads if branching on velocity sign).
5) Swap
Pointer swap is $O(1)$. If implemented as copy, it would be $O(nx \cdot ny)$ (we avoid that).
6) Output & stats (when triggered)
write_snapshot
: I/O bound; writing the full tile ⇒ $O(nx \cdot ny)$ data per rank (CSV is larger on disk).report_stats
: local reduction $O(nx \cdot ny)$ plus a globalMPI_Allreduce
→ $O(\log p)$ latency and tiny payload.
Global (per-step) Costs
Sum over all ranks:
- Compute: $\Theta(Nx \cdot Ny)$ for diffusion + $\Theta(Nx \cdot Ny)$ for advection → $\Theta(Nx \cdot Ny)$ overall.
- Communication: total halo volume across ranks is ~ perimeter of all tiles. For a balanced grid:
- Each interior edge counted twice; asymptotically, global halo volume per step is \(V_{\mathrm{halo, global}} \approx 16 \left( Px \cdot Ny + Py \cdot Nx \right)\ \text{bytes}\)
- Latency: $4p$ messages per step (but overlappable in practice).
Strong vs Weak Scaling Expectations
Strong scaling (fixed $Nx \times Ny$, increase $p$)
- Compute per rank: $\Theta((Nx\cdot Ny)/p)$ → ideal time $\propto 1/p$.
- Comm per rank: $\Theta(nx + ny)$ with $nx \approx Nx/Px$, $ny \approx Ny/Py$. If $Px \approx Py \approx \sqrt{p}$, then \(nx \approx \frac{Nx}{\sqrt{p}},\quad ny \approx \frac{Ny}{\sqrt{p}} \Rightarrow V_{\mathrm{halo}} \propto \frac{Nx + Ny}{\sqrt{p}}\)
- Efficiency drops when comm + latency + sync dominate compute. Expect a knee where local tiles become too small.
- Benchmark scripts will generate also Karp-Flatt metric as well as speedup and efficiency
Weak scaling (fixed $nx \times ny$ per rank, increase $p$)
- Compute per rank: ~constant → ideal time flat vs $p$.
- Comm per rank: $\Theta(nx + ny)$ → also constant (for fixed tile).
Degradation mainly due to global reductions, network congestion, and I/O.
Memory Footprint (per rank)
For two full fields (u
, tmp
) including halos:
Plus temporary lines, MPI buffers, and datatypes (usually negligible compared to arrays).
Putting It Together — Per-step Time Model
Let:
- $\alpha$ = latency (s/message), $\beta$ = inverse bandwidth (s/byte) for effective links.
- $c_d, c_a$ = compute costs per cell (s), for diffusion & advection respectively.
Then a simple overlapping model:
\[T_{\mathrm{step}} \approx \max\left[ (c_d + c_a)\,nx\,ny,\ \ 4\alpha + \beta\,V_{\mathrm{halo}}(nx, ny) \right] \ +\ T_{\mathrm{reduction}}\ (if\ stats)\]$T_{\mathrm{reduction}}$ for MPI_Allreduce
is ~ $O(\alpha \log p + \beta\cdot payload)$, tiny payload here.
Benchmarks to Run
- Strong scaling: $Nx=Ny=1024$,
steps=200
, ranks ∈ {1,2,4,8,16}.
Report: total runtime, per-step avg, $E_s = T_1 / (p\cdot T_p)$. - Weak scaling: per-rank $(nx, ny) = (256, 256)$, steps=200, ranks ∈ {1,4,16}.
Report: per-step avg and $E_w = T_{p0}/T_p$ with $p0=1$. - Sensitivity: vary halo $h$ (1→2) to see comm slope; vary output frequency to observe I/O impact.
Measurement Methodology
- Use
MPI_Wtime()
to bracket:- Entire run (after warm-up barrier) → total runtime.
- Inside loop → per-step time; accumulate average, min/max.
- Use
MPI_Reduce
to collect worst-step across ranks (max) and memory max (VmRSS
from/proc/self/status
on Linux). - For shell-level timing comparisons, use
/usr/bin/time -v mpirun ...
in addition to in-code timers.
Sanity Checks
- Determinism: identical outputs across runs with same seeds and decomposition.
- CFL/diffusion limits: automatically compute a safe $\Delta t$ from chosen $D$, $v_x$, $v_y$, $\Delta x$, $\Delta y$ and assert.
- Halo correctness: checksum per-rank borders before/after exchange.
Appendix: Quick-reference Formulas
- Halo volume per rank ($h=1$, doubles): \(V_{\mathrm{halo}} = 16\,(nx + ny)\ \text{bytes}\)
- Compute work per rank (per step): \(W \sim \Theta(nx\,ny)\)
- Memory (two fields with halos): \(M \approx 16\,(nx + 2)^2\ \text{bytes}\ \text{if}\ nx=ny\ \text{and}\ h=1\)
- Strong scaling efficiency: \(E_s(p) = \frac{T_1}{p\,T_p}\)
- Weak scaling efficiency (baseline at $p0$): \(E_w(p) = \frac{T_{p0}}{T_p}\)