Computational Analysis

This document derives compute, communication, memory, and I/O costs for the main loop in this project and ties them to expected strong/weak scaling behavior and benchmark methodology.

Assumptions

  • Global grid: Nx × Ny, uniform spacings Δx, Δy.
  • MPI 2D Cartesian decomposition: Px × Py ranks, p = Px · Py.
  • Per-rank interior tile: nx × ny where nx ≈ Nx/Px, ny ≈ Ny/Py (last ranks may get remainders).
  • Halo width: h = 1 (5-point diffusion, 1st-order upwind advection).
  • Data type: 8-byte double.
  • One field updated per step (u), with a ping-pong buffer (tmp).

Main Loop Pseudocode

for n in 0..steps-1:
  exchange_halos(u)
  apply_boundary(u)
  compute_diffusion(u, tmp)
  compute_advection(u, tmp)
  swap(u, tmp)
  if (n % out_every == 0): write_snapshot(u,n); report_stats(u,n)

Per-step Complexity (per rank)

1) Halo exchange

Faces exchanged once per step (nonblocking). For $h = 1$:

  • Left/right faces: $2 \times (ny \cdot h)$ doubles
  • Down/up faces: $2 \times (nx \cdot h)$ doubles
\[V_{\mathrm{halo}}(nx, ny) = 8 \left( 2\,ny + 2\,nx \right) \ \text{bytes} = 16 (nx+ny) \ \text{bytes}\]

Latency cost: 4 messages per step (L/R/D/U).
Bandwidth cost: proportional to $nx+ny$ bytes.

2) Boundary conditions

Applied only at physical boundaries. Per-rank cost is $O(nx + ny)$ (copy/mirror ghost rings). Interior ranks pay ~ $O(1)$ if neighbors exist.

3) Diffusion (5-point stencil, explicit)

Each interior point touches constant neighbors:

$W_{\mathrm{diff}}(nx, ny) = \Theta(nx \cdot ny)$

Arithmetic intensity (naive): ~(a few flops) per 5 loads + 1 store → memory bound unless tiled.

4) Advection (1st-order upwind)

Constant work per point (two 1D upwind derivatives):

\[W_{\mathrm{adv}}(nx, ny) = \Theta(nx \cdot ny)\]

Similar memory behavior to diffusion (more loads if branching on velocity sign).

5) Swap

Pointer swap is $O(1)$. If implemented as copy, it would be $O(nx \cdot ny)$ (we avoid that).

6) Output & stats (when triggered)

  • write_snapshot: I/O bound; writing the full tile ⇒ $O(nx \cdot ny)$ data per rank (CSV is larger on disk).
  • report_stats: local reduction $O(nx \cdot ny)$ plus a global MPI_Allreduce → $O(\log p)$ latency and tiny payload.

Global (per-step) Costs

Sum over all ranks:

  • Compute: $\Theta(Nx \cdot Ny)$ for diffusion + $\Theta(Nx \cdot Ny)$ for advection → $\Theta(Nx \cdot Ny)$ overall.
  • Communication: total halo volume across ranks is ~ perimeter of all tiles. For a balanced grid:
    • Each interior edge counted twice; asymptotically, global halo volume per step is \(V_{\mathrm{halo, global}} \approx 16 \left( Px \cdot Ny + Py \cdot Nx \right)\ \text{bytes}\)
    • Latency: $4p$ messages per step (but overlappable in practice).

Strong vs Weak Scaling Expectations

Strong scaling (fixed $Nx \times Ny$, increase $p$)

  • Compute per rank: $\Theta((Nx\cdot Ny)/p)$ → ideal time $\propto 1/p$.
  • Comm per rank: $\Theta(nx + ny)$ with $nx \approx Nx/Px$, $ny \approx Ny/Py$. If $Px \approx Py \approx \sqrt{p}$, then \(nx \approx \frac{Nx}{\sqrt{p}},\quad ny \approx \frac{Ny}{\sqrt{p}} \Rightarrow V_{\mathrm{halo}} \propto \frac{Nx + Ny}{\sqrt{p}}\)
  • Efficiency drops when comm + latency + sync dominate compute. Expect a knee where local tiles become too small.
  • Benchmark scripts will generate also Karp-Flatt metric as well as speedup and efficiency

Weak scaling (fixed $nx \times ny$ per rank, increase $p$)

  • Compute per rank: ~constant → ideal time flat vs $p$.
  • Comm per rank: $\Theta(nx + ny)$ → also constant (for fixed tile).
    Degradation mainly due to global reductions, network congestion, and I/O.

Memory Footprint (per rank)

For two full fields (u, tmp) including halos:

\[M_{\mathrm{fields}} = 2 \cdot (nx + 2h) (ny + 2h) \cdot 8\ \text{bytes}\]

Plus temporary lines, MPI buffers, and datatypes (usually negligible compared to arrays).

Putting It Together — Per-step Time Model

Let:

  • $\alpha$ = latency (s/message), $\beta$ = inverse bandwidth (s/byte) for effective links.
  • $c_d, c_a$ = compute costs per cell (s), for diffusion & advection respectively.

Then a simple overlapping model:

\[T_{\mathrm{step}} \approx \max\left[ (c_d + c_a)\,nx\,ny,\ \ 4\alpha + \beta\,V_{\mathrm{halo}}(nx, ny) \right] \ +\ T_{\mathrm{reduction}}\ (if\ stats)\]

$T_{\mathrm{reduction}}$ for MPI_Allreduce is ~ $O(\alpha \log p + \beta\cdot payload)$, tiny payload here.

Benchmarks to Run

  • Strong scaling: $Nx=Ny=1024$, steps=200, ranks ∈ {1,2,4,8,16}.
    Report: total runtime, per-step avg, $E_s = T_1 / (p\cdot T_p)$.
  • Weak scaling: per-rank $(nx, ny) = (256, 256)$, steps=200, ranks ∈ {1,4,16}.
    Report: per-step avg and $E_w = T_{p0}/T_p$ with $p0=1$.
  • Sensitivity: vary halo $h$ (1→2) to see comm slope; vary output frequency to observe I/O impact.

Measurement Methodology

  • Use MPI_Wtime() to bracket:
    • Entire run (after warm-up barrier) → total runtime.
    • Inside loop → per-step time; accumulate average, min/max.
  • Use MPI_Reduce to collect worst-step across ranks (max) and memory max (VmRSS from /proc/self/status on Linux).
  • For shell-level timing comparisons, use /usr/bin/time -v mpirun ... in addition to in-code timers.

Sanity Checks

  • Determinism: identical outputs across runs with same seeds and decomposition.
  • CFL/diffusion limits: automatically compute a safe $\Delta t$ from chosen $D$, $v_x$, $v_y$, $\Delta x$, $\Delta y$ and assert.
  • Halo correctness: checksum per-rank borders before/after exchange.

Appendix: Quick-reference Formulas

  • Halo volume per rank ($h=1$, doubles): \(V_{\mathrm{halo}} = 16\,(nx + ny)\ \text{bytes}\)
  • Compute work per rank (per step): \(W \sim \Theta(nx\,ny)\)
  • Memory (two fields with halos): \(M \approx 16\,(nx + 2)^2\ \text{bytes}\ \text{if}\ nx=ny\ \text{and}\ h=1\)
  • Strong scaling efficiency: \(E_s(p) = \frac{T_1}{p\,T_p}\)
  • Weak scaling efficiency (baseline at $p0$): \(E_w(p) = \frac{T_{p0}}{T_p}\)