# CS 5220
## Distributed memory
### Networks and models
## 06 Oct 2015
### Basic questions
- How much does a message cost?
- *Latency*: time to get between processors
- *Bandwidth*: data transferred per unit time
- How does *contention* affect communication?
- This is a combined hardware-software question!
- We want to understand just enough for reasonable modeling.
### Thinking about interconnects
Several features characterize an interconnect:
- *Topology*: who do the wires connect?
- *Routing*: how do we get from A to B?
- *Switching*: circuits, store-and-forward?
- *Flow control*: how do we manage limited resources?
### Thinking about interconnects
- Links are like streets
- Switches are like intersections
- Hops are like blocks traveled
- Routing algorithm is like a travel plan
- Stop lights are like flow control
- Short packets are like cars, long ones like buses?
At some point the analogy breaks down...
Bus topology
One set of wires (the bus)
Only one processor allowed at any given time
Contention for the bus is an issue
Example: basic Ethernet, some SMPs
Crossbar
Dedicated path from every input to every output
Takes $O(p^2)$ switches and wires!
### Bus vs. crossbar
- Crossbar: more hardware
- Bus: more contention (less capacity?)
- Generally seek happy medium
- Less contention than bus
- Less hardware than crossbar
- May give up one-hop routing
### Network properties
Think about latency and bandwidth via two quantities:
- *Diameter*: max distance between nodes
- *Bisection bandwidth*: smallest bandwidth cut to bisect
- Particularly important for all-to-all communication
Linear topology
$p-1$ links
Diameter $p-1$
Bisection bandwidth $1$
Ring topology
$p$ links
Diameter $p/2$
Bisection bandwidth $2$
Mesh
May be more than two dimensions
Route along each dimension in turn
Torus
Torus : Mesh :: Ring : Linear
Hypercube
Label processors with binary numbers
Connect $p_1$ to $p_2$ if labels differ in one bit
Fat tree
Processors at leaves
Increase link bandwidth near root
### Others...
- Butterfly network
- Omega network
- Cayley graph
### Conventional wisdom
- Roughly constant latency (?)
- Wormhole routing (or cut-through) flattens latencies vs
store-forward at hardware level
- Software stack dominates HW latency!
- Latencies *not* same between networks (in box vs
across)
- May also have store-forward at library level
- Avoid topology-specific optimization
- Want code that runs on next year’s machine, too!
- Bundle topology awareness in vendor MPI libraries?
- Sometimes specify a *software* topology