# CS 5220 ## Shared memory ### OpenMP ## 24 Sep 2015
### Shared memory programming model Program consists of *threads* of control. - Can be created dynamically - Each has private variables (e.g. local) - Each has shared variables (e.g. heap) - Communication through shared variables - Coordinate by synchronizing on variables - Examples: *OpenMP*, pthreads, Cilk, Java threads
### The problem with pthreads revisited - pthreads can be painful! - Makes code verbose - Synchronization is hard to think about - Would like to make this more automatic! - ... and have been trying for a couple decades. - OpenMP gets us *part* of the way
### OpenMP: Open spec for MultiProcessing - Standard API for multi-threaded code - Only a spec — multiple implementations - Lightweight syntax - C or Fortran (with appropriate compiler support) - High level: - Preprocessor/compiler directives (80%) - Library calls (19%) - Environment variables (1%)

Compiling OpenMP

A practical aside...

  • OpenMP is supported by Intel and GCC
  • Not in main Clang release
  • I use GCC from Homebrew for OpenMP on OS X
  • GCC: Need -fopenmp for both compile and link lines
    
    gcc -fopenmp -c foo.c
    gcc -fopenmp -o mycode.x foo.o
    
  • Intel: Need -openmp for both compile and link lines
    
    icc -openmp -c foo.c
    icc -openmp -o mycode.x foo.o
    
### Parallel “hello world” #include <stdio.h> #include <omp.h> int main() { #pragma omp parallel printf("Hello world from %d\n", omp_get_thread_num()); return 0; }

Parallel sections


  • Basic model: fork-join
  • Each thread runs same code block
  • Annotations distinguish shared ($s$) and private ($i$) data
  • Relaxed consistency for shared data

Parallel sections



double s[MAX_THREADS];
int i;
#pragma omp parallel shared(s) private(i)
{
  i = omp_get_thread_num();
  s[i] = i;
}    

Critical sections


  • Automatically lock/unlock at ends of critical section
  • Automatically memory flushes for consistency
  • Locks are still there if you really need them...

Critical sections



#pragma omp parallel {
  //...
  #pragma omp critical my_data_cs
  {
    //... modify data structure here ...
  }
}

Barriers



#pragma omp parallel
for (i = 0; i < nsteps; ++i) {
  do_stuff();
  #pragma omp barrier
}

Parallel loops


  • Independent loop body? At least order doesn’t matter.
  • Partition index space among threads
  • Implicit barrier at end (except with nowait)
### Parallel loops /* Compute dot of x and y of length n */ int i, tid; double my_dot, dot = 0; #pragma omp parallel \ shared(dot,x,y,n) \ private(i,my_dot) { tid = omp_get_thread_num(); my_dot = 0; #pragma omp for for (i = 0; i < n; ++i) my_dot += x[i]*y[i]; #pragma omp critical dot += my_dot; }
### Parallel loops /* Compute dot of x and y of length n */ int i, tid; double dot = 0; #pragma omp parallel \ shared(x,y,n) \ private(i) \ reduction(+:dot) { #pragma omp for for (i = 0; i < n; ++i) dot += x[i]*y[i]; }
### Parallel loop scheduling Partition index space different ways: - `static[(chunk)]`: decide at start of loop; default chunk is `n/nthreads`. Low overhead, potential load imbalance. - `dynamic[(chunk)]`: each thread takes `chunk` iterations when it has time; default `chunk` is 1. Higher overhead, but automatically balances load. - `guided`: take chunks of size unassigned iterations/threads; chunks get smaller toward end of loop. Somewhere between `static` and `dynamic`. - `auto`: up to the system! Default behavior is implementation-dependent.
### Other parallel work divisions - `single`: do only in one thread (e.g. I/O) - `master`: do only in one thread; others skip - `sections`: like cobegin/coend
### Tasks - So far, very static flavors of parallelism - *Tasks* allow more dynamic parallel patterns - From OpenMP 3.0 on, [explicit tasking support](http://openmp.org/sc13/sc13.tasking.ruud.pdf)
### Tasks #pragma omp parallel { #pragma omp single { // General setup work #pragma omp task task1(); #pragma omp task task2(); #pragma omp taskwait depends_on_both_tasks(); } }
### Linked list Adapted from [an SC13 presentation](http://openmp.org/sc13/sc13.tasking.ruud.pdf) node_t* p = head; #pragma omp parallel { #pragma omp single nowait while (p != NULL) { #pragma omp task firstprivate(p) do_work(p); p = p->next; } } // Implied barrier at end of parallel region
### [Post-order traversal](http://openmp.org/wp/presos/sc07openmpbof.pdf) void traverse(node_t* p) { if (p->left) #pragma omp task traverse(p->left); if (p->right) #pragma omp task travers(p->right); #pragma omp taskwait process(p->data); }

Essential complexity?

Fred Brooks (Mythical Man Month) identified two types of software complexity: essential and accidental.

Does OpenMP address accidental complexity? Yes, somewhat!

Essential complexity is harder.

### Things to still think about with OpenMP - Proper serial performance tuning? - Minimizing false sharing? - Minimizing synchronization overhead? - Minimizing loop scheduling overhead? - Load balancing? - Finding enough parallelism in the first place? Let’s focus again on memory issues...