# CS 5220: Applications of Parallel Computers
## Memory matters
## 01 Sep 2015
## [My laptop](http://ark.intel.com/products/75991/Intel-Core-i5-4288U-Processor-3M-Cache-up-to-3_10-GHz)
- Theoretical peak: roughly 100 GFlop/s
- Peak memory bandwidth: 25.6 GigaBytes/s
- Arithmetic intensity = flops/memory access
- Low arithmetic intensity is bad news...
How long will it take?
Consider my machine (peak 100 GFlop/s, peak memory bandwidth 25.6 GB/s).
I have a code with double precision arithmetic intensity one.
What is the max flop rate?
A: 100 GFlop/s
=: I appreciate your optimism, but no.
A: 25.6 GFlop/s
=: Right idea, but each double is eight bytes.
A: 3.2 GFlop/s
=: Yup.
A: 320 MFlop/s
=: That's a bit too pessimistic.
## Memory basics
- Memory *latency* = how long to get requested item
- Memory *bandwidth* = rate memory can provide data
- Bandwidth improving faster than latency
- Processor demand growing faster than either!
## Cache basics
- Programs usually have *locality*
- *Spatial*: nearby items accessed consecutively
- *Temporal*: use a small "working set" repeatedly
- Cache hierarchy built to use locality
- Cache = small, fast memory
- Several types of cache on modern chips
## Caches help...
- Hide memory cost by reusing data
- Exploits temporal locality
- Use bandwidth to fetch *cache line* all at once
- Exploits spatial locality
- Use bandwidth to support multiple outstanding reads
- Overlap computation + communication with memory
- aka prefetching
This is (mostly) automatic and implicit.
## Cache organization
- Store cache *lines* of several bytes
- Cache *hit* = copy of needed data in cache
- Cache *miss* otherwise. Three types:
- Compulsory: data never used before
- Capacity: working set too big, discarded data
- Conflict: insufficient *associativity* for access pattern
- Cache *hit rate* = cache hits / memory accesses attempted
## Cache associativity
- Where can data for a given main memory address go?
- Direct-mapped: only one cache location
- n-way set associative: n possible cache locations
- Fully associative: anywhere in cache
- Ex: 8-bit addresses $10011101_2$
- Cache location based on low-order bits of address
- Direct mapped (16 entries): only store in $1101_2$
- 4-way associative (64 entries): four possible locations
- In either case, address $10111101_2$ would conflict
- High associativity is more expensive
## Caches on my laptop
Multiple *levels* of cache with different sizes and latencies.
Cache lines are 64B in all cases, I think.
- Data caches:
- L1 cache: 64 KB/core, 8-way (4 clocks)
- L2 cache: 256 KB/core, 8-way (12 clocks)
- L3 cache: 3 MB (shared), direct mapped (21 clocks?)
- Also have instruction caches for code (less worry)
Miss in lower level cache may still hit in higher level cache.
## Modeling question
Consider my machine (100 Gigaflop/s peak, 25.6 GB/s bandwidth).
Suppose a workload of mostly double precision fused multiply-adds.
What is the minimum cache hit rate needed to maintain half peak?
(We'll talk more about this in class)
## Modeling question
We have $N = 10^6$ two-dimensional coordinates, and want the centroid.
Which of these is faster and why?
1. Store an array of $(x_i, y_i)$ coordinates. Loop $i$ and
simultaneously sum the $x_i$ and $y_i$.
2. Store an array of $(x_i, y_i)$ coordinates. Loop $i$ and
sum the $x_i$, then sum the $y_i$ in a separate loop.
3. Store the $x_i$ n one array, the $y_i$ in a second array.
Sum the $x_i$, then sum the $y_i$.
4. Other methods?
Try it out and see!
## A memory benchmark (membench)
for array A of length L from 4KB to 8MB by 2x
for stride s from 4 bytes to L/2 by 2x
time the following loop
for i = 0 to L by s
load A[i]
Membench on my laptop
Membench on my laptop