# CS 5220: Applications of Parallel Computers
## Intro to Performance Analysis
## 27 Aug 2015
## Reading Note
This is a supplement to [the notes](/2015/08/10/performance.html).
Go read them, too!
## How Fast Can We Go?
- Speed in flop/s for Linpack: [top500](http://www.top500.org)
- Giga ($10^9$) -- a single core
- Tera ($10^{12}$) -- a big machine
- Peta ($10^{15}$) -- current top 10 machines (5 in US)
- Exa ($10^{18}$) -- favorite of funding agencies
- Current record-holder: China's Tianhe-2
- 33.9 Petaflop/s (54.9 theoretical peak)
- 17.8 MW + cooling
## Tianhe-2 environment
Commodity nodes, custom interconnect:
- Xeon E5-2692 nodes with Phi accelerators
- Intel compilers + Intel math kernel libraries
- MPICH2 MPI with customized channel
- Kylin Linux
- *TH Express-2 interconnect*
## A US Contender
[Sequoia at LLNL (3 of 500)](http://www.top500.org/system/177556)
- 20.1 Petaflop/s theoretical peak
- 17.2 Petaflop/s Linpack benchmark (86%)
- 14.4 Petaflop/s in a bubble-cloud sim (72%)
- 2013 Gordon Bell Prize
- 2010 Prize was 30% peak on ORNL Jaguar
- Performance on more standard code?
- 10% is probably very good!
## Parallel Performance in Practice
- Peak > Linpack > Gordon Bell > Typical
- Measuring performance of real applications is hard
- Typically a few bottlenecks slow things down
- Figuring out why can be tricky!
- And we *really* care about time-to-solution
- Sophisticated methods get answers in fewer flops
- ... but may look bad in flop rate benchmarks
- Lots of delusion and deception in performance analysis
- Read [the notes]((/2015/08/10/performance.html)!
## Quantifying Parallel Performance
- Starting point: good *serial* performance
- Strong scaling: compare parallel to serial time (fixed size)
- Speedup = Serial Time / Parallel Time
- Efficiency = Speedup / p
- Ideally, speedup = p; usually lower
- Barriers to perfect speedup
- Serial work (Amdahl's law)
- Parallel overheads (communication, synchronization)
Amdahl
\begin{align}
p = & \mbox{ number of processors} \\
s = & \mbox{ fraction of work that is serial} \\
t_s = & \mbox{ serial time} \\
t_p = & \mbox{ parallel time} \geq s t_s + (1-s) t_s/p
\end{align}
$$
\mbox{Speedup} = \frac{t_s}{t_p} = \frac{1}{s + (1-s)/p} < \frac{1}{s}
$$
Things look better if $n$ grows with $p$ (a weak scaling study)
## Summary
- We're approaching exaflop *peak* rates
- Codes rarely get peak performance
- Better: Compare to tuned serial performance
- Measure *speedup* and *efficiency*
- Strong scaling: increase $p$, fix $n$
- Weak scaling: increase both $p$ and $n$
- Serial overheads and communication kill speedup
- Simple analytical models help understand scaling