# CS 5220: Applications of Parallel Computers ## Instruction-level parallelism ## 01 Sep 2015
## Example 1: Laundry - Three stages to laundry: wash, dry, fold - Three loads: darks, lights, underwear - How long will this take?

How long will it take?

Three loads of laundry to wash, dry, fold. One hour per stage. What is the total time? A: 9 hours =: You spend too much time on laundry A: 5 hours =: That's what I had in mind! A: 3 hours =: Maybe at a laundromat; what if only one washer/drier?
## Setup - Three *functional units* - Washer - Drier - Folding table - Different cases - One load at a time - Three loads with one washer/drier - Three loads with friends at the laundromat

Serial execution (9 hours)

Wash Dry Fold
Wash Dry Fold
Wash Dry Fold

Pipelined execution (5 hours)

Wash Dry Fold
Wash Dry Fold
Wash Dry Fold

Parallel units (3 hours)

Wash Dry Fold
Wash Dry Fold
Wash Dry Fold
## Example 2: Arithmetic $$2 \times 2 + 3 \times 3$$ > A child of five would understand this. Send someone to fetch a child > of five. > -- [Groucho Marx](http://www.goodreads.com/quotes/98966-a-child-of-five-could-understand-this-send-someone-to)

How long will it take?

Suppose all children can do one add or multiply per second. How long would it take to compute $2 \times 2 + 3 \times 3$? A: 3 seconds =: OK, three ops at one op/s; what if there are multiple kids? A: 2 seconds =: OK, if two kids do the multiplies in parallel A: 1 second =: Not without finding faster kids!

One child

$2 \times 2 = 4$ $3 \times 3 = 9$ $4 + 9 = 13$

Total time is 3 seconds

Two children

$2 \times 2 = 4$ $4 + 9 = 13$
$3 \times 3 = 9$

Total time is 2 seconds

Many children

$2 \times 2 = 4$ $4 + 9 = 13$
$3 \times 3 = 9$

Total time remains 2 seconds = sum of latencies for two stages with a data dependency between them.

## Pipelining - Improves *bandwidth*, but not *latency* - Potential speedup = number of stages - What if there's a branch? - Different pipelines for different functional units - Front-end has a pipeline - Functional units (FP adder, multiplier) pipelined - Divider often not pipelined
## SIMD - Single Instruction Multiple Data - Old idea with resurgence in 90s (for graphics) - Now short vectors are ubiquitous - 256 bit wide AVX on CPU - 512 bit wide on Xeon Phi! - Alignment matters
## Example: [My laptop](http://www.everymac.com/systems/apple/macbook_pro/specs/macbook-pro-core-i5-2.6-13-late-2013-retina-display-specs.html) MacBook Pro (Retina, 13 in, Late 2013) - [Intel Core i5-4228U (Haswell arch)](http://ark.intel.com/products/75991/Intel-Core-i5-4288U-Processor-3M-Cache-up-to-3_10-GHz) - Two cores / four HW threads - Variable clock: 2.6 GHz / 3.1 GHz TurboBoost - Four wide front end (fetch+decode 4 ops/cycle/core) - Operations internally broken down into "micro-ops" - Cache micro-ops -- like hardware JIT?!
## My laptop: floating point - Two fully-pipelined multiply or FMA per cycle - FMA = Fused Multiply Add: one op, one rounding error - [256 bit SIMD (AVX)](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions) - [Two fully pipelined FP units](http://www.realworldtech.com/haswell-cpu/4/) - Two multiply or Fused Multiply-Add (FMA) per cycle - Only one regular add per cycle
## Peak flop rate - Result (double precision) $\approx 100$ GFlop/s - 2 flops/FMA - $\times 4$ FMA/vector FMA = 8 flops/vector FMA - $\times 2$ vector FMAs/cycle = 16 flops/cycle - $\times 2$ cores = 32 flops/cycle - $\times 3.1 \times 10^9$ cycles/s $\approx 100$ GFlop/s - Single precision $\approx 200$ GFlop/s
## Reaching peak flop - Need lots of *independent* vector work - FMA latency = 5 cycles on Haswell - Need $8 \times 5 = 40$ *independent* FMA to reach peak - Great for matrix multiply -- hard in general - Still haven't [talked about memory!](/slides/2015-09-01-memory.html)
## Punchline - Special features: SIMD, FMA - Compiler understands how to use these *in principle* - Rearranges instructions to get good mix - Tries to use FMAs, vector instructions - *In practice*, the compiler needs your help - Set optimization flags, pragmas, etc - Rearrange code to make obvious and predictable - Use special intrinsics or library routines - Choose data layouts + algorithms to suit machine - Goal: You handle high-level, compiler handles low-level