# CS 5220: Applications of Parallel Computers
## Matmul and tiling
## 08 Sep 2015
## A memory benchmark (membench)
    for array A of length L from 4KB to 8MB by 2x
      for stride s  from 4 bytes to L/2 by 2x
        time the following loop
        for i = 0 to L by s
          load A[i]
Membench in pictures
  - Size = 64 bytes (16 ints)
 
  - Strides of 4, 8, 16, 32 bytes
 
Membench on totient CPU
  - Vertical: 64B line size, 4K page size
 
  - Horizontal: 64KB L1, 256KB L2, 15MB L3
 
  - Diagonal: 8-way cache associativity, 512 entry L2 TLB
 
Note on storage

  - Two standard layouts:
    
    - Column major (Fortran): A(i,j) at A+i+j*n
 
    - Row-major (C): A(ij) at A+i*n+j
 
    
   
  - I default to column major
 
  - Also note: C has poor language matrix support
 
## Matrix multiply
How fast can naive matrix multiply run?
    #define A(i,j) AA[i+j*n]
    #define B(i,j) BB[i+j*n]
    #define C(i,j) CC[i+j*n]
    memset(C, 0, n*n*sizeof(double));
    for (int i = 0; i < n; ++i)
        for (int j = 0; j < n; ++j)
            for (int k = 0; k < n; ++k)
                C(i,j) += A(i,k) * B(k,j);
One row in naive

  - Access $A$ and $C$ with stride $8n$ bytes
 
  - Access all $8n^2$ bytes of $B$ before first re-use
 
  - Poor arithmetic intensity
 
Matrix multiply compared (Totient + ICC)
Hmm...

  - Compiler makes some difference
 
  - Naive Fortran is faster than naive C
 
  - Local instruction mix sets 
speed of light
 
  - Access pattern determines how close we get to limit
 
Engineering strategy

  - Start with small 
kernel
 multiply
  
    - Maybe odd sizes, strange layouts -- just go fast!
 
    - May play with AVX intrinsics, compiler flags, etc
 
    - Deserves its own timing rig
 
  
   - Use blocking based on kernel to improve access pattern
 
## Simple model
- Two types of memory (fast+slow)
  - $m$ = words read from slow memory
  - $t_m$ = slow memory op time
  - $f$ = number of flops
  - $t_f$ = time per flop
  - $q = f/m$ = average flops/slow access
- Time:
  $$ft_f + mt_m = ft_f \left( 1 + \frac{t_m/t_f}{q} \right)$$
- Larger $q$ means better time
## How big can $q$ be?
Level 1/2/3 Basic Linear Algebra Subroutines (BLAS)
1.  Dot product: $n$ data, $2n$ flops
2.  Matrix-vector multiply: $n^2$ data, $2n^2$ flops
3.  Matrix-matrix multiply: $2n^2$ data, $2n^3$ flops
We like to build on level 3 BLAS (like matrix multiplication)!
## Tuning matrix multiply
  
- [Matmul assignment is up](https://github.com/cornell-cs5220-f15/matmul)
- You will get email with group assignments
- Goal is single core performance *analysis* and *tuning*
- Deliverables
  - Report describing strategy and performance results
  - Pointer to a repository so we can run a competition
## Possible tactics
- Manually tune some small kernels
- Write an auto-tuner to sweep parameters
- Try different compilers (and flags)
- Try different layouts
- Copy optimization
- Study strategies from past/present classes!
## Warning
  
- Tuning can be like video games!
- Do spend the time to do a good job
- Don't get so sucked in you neglect more important things