# CS 5220: Applications of Parallel Computers
## Instruction-level parallelism
## 01 Sep 2015
## Example 1: Laundry
- Three stages to laundry: wash, dry, fold
- Three loads: darks, lights, underwear
- How long will this take?
How long will it take?
Three loads of laundry to wash, dry, fold.
One hour per stage. What is the total time?
A: 9 hours
=: You spend too much time on laundry
A: 5 hours
=: That's what I had in mind!
A: 3 hours
=: Maybe at a laundromat; what if only one washer/drier?
## Setup
- Three *functional units*
- Washer
- Drier
- Folding table
- Different cases
- One load at a time
- Three loads with one washer/drier
- Three loads with friends at the laundromat
Serial execution (9 hours)
Wash |
Dry |
Fold |
|
|
|
|
|
|
|
|
|
Wash |
Dry |
Fold |
|
|
|
|
|
|
|
|
|
Wash |
Dry |
Fold |
Pipelined execution (5 hours)
Wash |
Dry |
Fold |
|
|
|
Wash |
Dry |
Fold |
|
|
|
Wash |
Dry |
Fold |
Parallel units (3 hours)
Wash |
Dry |
Fold |
Wash |
Dry |
Fold |
Wash |
Dry |
Fold |
## Example 2: Arithmetic
$$2 \times 2 + 3 \times 3$$
> A child of five would understand this. Send someone to fetch a child
> of five.
> -- [Groucho Marx](http://www.goodreads.com/quotes/98966-a-child-of-five-could-understand-this-send-someone-to)
How long will it take?
Suppose all children can do one add or multiply per second.
How long would it take to compute $2 \times 2 + 3 \times 3$?
A: 3 seconds
=: OK, three ops at one op/s; what if there are multiple kids?
A: 2 seconds
=: OK, if two kids do the multiplies in parallel
A: 1 second
=: Not without finding faster kids!
One child
$2 \times 2 = 4$ |
$3 \times 3 = 9$ |
$4 + 9 = 13$ |
Total time is 3 seconds
Two children
$2 \times 2 = 4$ |
$4 + 9 = 13$ |
$3 \times 3 = 9$ |
|
Total time is 2 seconds
Many children
$2 \times 2 = 4$ |
$4 + 9 = 13$ |
$3 \times 3 = 9$ |
|
Total time remains 2 seconds = sum of latencies for
two stages with a data dependency between them.
## Pipelining
- Improves *bandwidth*, but not *latency*
- Potential speedup = number of stages
- What if there's a branch?
- Different pipelines for different functional units
- Front-end has a pipeline
- Functional units (FP adder, multiplier) pipelined
- Divider often not pipelined
## SIMD
- Single Instruction Multiple Data
- Old idea with resurgence in 90s (for graphics)
- Now short vectors are ubiquitous
- 256 bit wide AVX on CPU
- 512 bit wide on Xeon Phi!
- Alignment matters
## Example: [My laptop](http://www.everymac.com/systems/apple/macbook_pro/specs/macbook-pro-core-i5-2.6-13-late-2013-retina-display-specs.html)
MacBook Pro (Retina, 13 in, Late 2013)
- [Intel Core i5-4228U (Haswell arch)](http://ark.intel.com/products/75991/Intel-Core-i5-4288U-Processor-3M-Cache-up-to-3_10-GHz)
- Two cores / four HW threads
- Variable clock: 2.6 GHz / 3.1 GHz TurboBoost
- Four wide front end (fetch+decode 4 ops/cycle/core)
- Operations internally broken down into "micro-ops"
- Cache micro-ops -- like hardware JIT?!
## My laptop: floating point
- Two fully-pipelined multiply or FMA per cycle
- FMA = Fused Multiply Add: one op, one rounding error
- [256 bit SIMD (AVX)](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions)
- [Two fully pipelined FP units](http://www.realworldtech.com/haswell-cpu/4/)
- Two multiply or Fused Multiply-Add (FMA) per cycle
- Only one regular add per cycle
## Peak flop rate
- Result (double precision) $\approx 100$ GFlop/s
- 2 flops/FMA
- $\times 4$ FMA/vector FMA = 8 flops/vector FMA
- $\times 2$ vector FMAs/cycle = 16 flops/cycle
- $\times 2$ cores = 32 flops/cycle
- $\times 3.1 \times 10^9$ cycles/s $\approx 100$ GFlop/s
- Single precision $\approx 200$ GFlop/s
## Reaching peak flop
- Need lots of *independent* vector work
- FMA latency = 5 cycles on Haswell
- Need $8 \times 5 = 40$ *independent* FMA to reach peak
- Great for matrix multiply -- hard in general
- Still haven't [talked about memory!](/slides/2015-09-01-memory.html)
## Punchline
- Special features: SIMD, FMA
- Compiler understands how to use these *in principle*
- Rearranges instructions to get good mix
- Tries to use FMAs, vector instructions
- *In practice*, the compiler needs your help
- Set optimization flags, pragmas, etc
- Rearrange code to make obvious and predictable
- Use special intrinsics or library routines
- Choose data layouts + algorithms to suit machine
- Goal: You handle high-level, compiler handles low-level