# CS 5220
## Distributed memory
### Modeling message costs
## 06 Oct 2015
### Basic questions
- How much does a message cost?
- *Latency*: time to get between processors
- *Bandwidth*: data transferred per unit time
- How does *contention* affect communication?
- This is a combined hardware-software question!
- Goal: understand just enough to model roughly
### Conventional wisdom
- Roughly constant latency (?)
- Wormhole routing (or cut-through) flattens latencies vs
store-forward at hardware level
- Software stack dominates HW latency!
- Latencies *not* same between networks (in box vs
across)
- May also have store-forward at library level
- Avoid topology-specific optimization
- Want code that runs on next year’s machine, too!
- Bundle topology awareness in vendor MPI libraries?
- Sometimes specify a *software* topology
### $\alpha$-$\beta$ model
Crudest model: $t_{\mathrm{comm}} = \alpha + \beta M$
- $t_{\mathrm{comm}} = $ communication time
- $\alpha = $ latency
- $\beta = $ inverse bandwidth
- $M = $ message size
Works pretty well for basic guidance!
Typically $\alpha \gg \beta \gg t_{\mathrm{flop}}$. More money on
network, lower $\alpha$.
### LogP model
Like $\alpha$-$\beta$, but includes CPU time on send/recv:
- Latency: the usual
- Overhead: CPU time to send/recv
- Gap: min time between send/recv
- P: number of processors
Assumes small messages (gap $\sim$ bw for fixed message size).
### Communication costs
Some basic goals:
- Prefer larger to smaller messages (avoid latency)
- Avoid communication when possible
- Great speedup for Monte Carlo and other embarrassingly parallel
codes!
- Overlap communication with computation
- Models tell you how much computation is needed to mask
communication costs.
### Intel MPI on Totient
- Two 6-core chips per nodes, eight nodes
- Heterogeneous network:
- Ring between cores
- Bus between chips
- Gigabit ethernet between nodes
- Test ping-pong
- Between cores on same chip
- Between chips on same node
- Between nodes
Approximate $\alpha$-$\beta$ parameters (on node)
Approximate $\alpha$-$\beta$ parameters (cross-node)
Network model
- On-chip: $\alpha$-$\beta$ works well!
- Off-chip: Not so much
- But cross-node communication is clearly expensive!
### Moral
Not all links are created equal!
- Might handle with mixed paradigm
- OpenMP on node, MPI across
- Have to worry about thread-safety of MPI calls
- Can handle purely within MPI
- Can ignore the issue completely?
For today, we’ll take the last approach.