CS 5220: Applications of Parallel Computers
Instruction-level parallelism
01 Sep 2015
Example 1: Laundry
- Three stages to laundry: wash, dry, fold
- Three loads: darks, lights, underwear
- How long will this take?
How long will it take?
Three loads of laundry to wash, dry, fold.
One hour per stage. What is the total time?
Setup
- Three functional units
- Washer
- Drier
- Folding table
- Different cases
- One load at a time
- Three loads with one washer/drier
- Three loads with friends at the laundromat
Serial execution (9 hours)
Wash |
Dry |
Fold |
|
|
|
|
|
|
|
|
|
Wash |
Dry |
Fold |
|
|
|
|
|
|
|
|
|
Wash |
Dry |
Fold |
Pipelined execution (5 hours)
Wash |
Dry |
Fold |
|
|
|
Wash |
Dry |
Fold |
|
|
|
Wash |
Dry |
Fold |
Parallel units (3 hours)
Wash |
Dry |
Fold |
Wash |
Dry |
Fold |
Wash |
Dry |
Fold |
Example 2: Arithmetic
2×2+3×3
A child of five would understand this. Send someone to fetch a child
of five.
-- Groucho Marx
How long will it take?
Suppose all children can do one add or multiply per second.
How long would it take to compute 2×2+3×3?
One child
Total time is 3 seconds
Two children
Total time is 2 seconds
Many children
Total time remains 2 seconds = sum of latencies for
two stages with a data dependency between them.
Pipelining
- Improves bandwidth, but not latency
- Potential speedup = number of stages
- What if there's a branch?
- Different pipelines for different functional units
- Front-end has a pipeline
- Functional units (FP adder, multiplier) pipelined
- Divider often not pipelined
SIMD
- Single Instruction Multiple Data
- Old idea with resurgence in 90s (for graphics)
- Now short vectors are ubiquitous
- 256 bit wide AVX on CPU
- 512 bit wide on Xeon Phi!
- Alignment matters
MacBook Pro (Retina, 13 in, Late 2013)
- Intel Core i5-4228U (Haswell arch)
- Two cores / four HW threads
- Variable clock: 2.6 GHz / 3.1 GHz TurboBoost
- Four wide front end (fetch+decode 4 ops/cycle/core)
- Operations internally broken down into "micro-ops"
- Cache micro-ops -- like hardware JIT?!
My laptop: floating point
Peak flop rate
- Result (double precision) ≈100 GFlop/s
- 2 flops/FMA
- ×4 FMA/vector FMA = 8 flops/vector FMA
- ×2 vector FMAs/cycle = 16 flops/cycle
- ×2 cores = 32 flops/cycle
- ×3.1×109 cycles/s ≈100 GFlop/s
- Single precision ≈200 GFlop/s
Reaching peak flop
- Need lots of independent vector work
- FMA latency = 5 cycles on Haswell
- Need 8×5=40 independent FMA to reach peak
- Great for matrix multiply -- hard in general
- Still haven't talked about memory!
Punchline
- Special features: SIMD, FMA
- Compiler understands how to use these in principle
- Rearranges instructions to get good mix
- Tries to use FMAs, vector instructions
- In practice, the compiler needs your help
- Set optimization flags, pragmas, etc
- Rearrange code to make obvious and predictable
- Use special intrinsics or library routines
- Choose data layouts + algorithms to suit machine
- Goal: You handle high-level, compiler handles low-level
CS 5220: Applications of Parallel Computers
Instruction-level parallelism
01 Sep 2015