CS 5220 architecture intro

# CS 5220: Applications of Parallel Computers ## Parallel machines and models ## 10 Sep 2015

## Why clusters? - Clusters of SMPs are everywhere - Commodity hardware – economics! - Supercomputer = cluster + custom interconnect - Relatively simple to set up and administer (?) - But still costs room, power, ... - Economy of scale $\implies$ clouds? - Amazon now has HPC instances on EC2 - StarCluster: launch your own EC2 cluster - Lots of interesting challenges here

## Totient structure Consider: - Each core has vector parallelism - Chips have six cores, shares memory with others - Accelerators have sixty cores, shared memory - Each box has two chips + accelerators - Eight instructional nodes, communicate via Ethernet How did we get here? Why this type of structure? And how does the programming model match the hardware?

## Parallel computer hardware - Physical machine has *processors*, *memory*, and *interconnect* - Where is memory physically? - Is it attached to processors? - What is the network connectivity? - Programming *model* through languages, libraries. - Programming model $\neq$ hardware organization! - Can run MPI on a shared memory node!

## Parallel programming model - Control - How is parallelism created? - What ordering is there between operations? - Data - What data is private or shared? - How is data logically shared or communicated? - Synchronization - What operations are used to coordinate? - What operations are atomic? - Cost: how do we reason about each of above?

## Simple example Consider dot product of $x$ and $y$. - Where do arrays $x$ and $y$ live? One CPU? Partitioned? - Who does what work? - How do we combine to get a single final result?

## Shared memory programming model Program consists of *threads* of control. - Can be created dynamically - Each has private variables (e.g. local) - Each has shared variables (e.g. heap) - Communication through shared variables - Coordinate by synchronizing on variables - Examples: OpenMP, pthreads

## Shared memory dot product Dot product of two $n$ vectors on $p \ll n$ processors: 1. Each CPU evaluates partial sum ($n/p$ elements, local) 2. Everyone tallies partial sums Can we go home now?

## Race condition A *race condition*: - Two threads access same variable, at least one write. - Access are concurrent – no ordering guarantees - Could happen simultaneously! Need synchronization via lock or barrier.

## Race to the dot Consider `S += partial_sum` on 2 CPU: - P1: Load `S` - P1: Add `partial_sum` - P2: Load `S` - P1: Store new `S` - P2: Add `partial_sum` - P2: Store new `S`

## Shared memory dot with locks Solution: consider `S += partial_sum` a *critical section* - Only one CPU at a time allowed in critical section - Can violate invariants locally - Enforce via a lock or mutex (mutual exclusion variable) Dot product with mutex: 1. Create global mutex l 2. Compute partial_sum 3. Lock l 4. S += partial_sum 5. Unlock l

## Shared memory with barriers - Many codes have phases (e.g. time steps) - Communication only needed at end of phases - Idea: synchronize on end of phase with *barrier* - More restrictive (less efficient?) than small locks - Easier to think through! (e.g. less chance of deadlocks) - Sometimes called *bulk synchronous programming*

## Shared memory machine model - Processors and memories talk through a bus - Symmetric Multiprocessor (SMP) - Hard to scale to lots of processors (think $\leq 32$) - Bus becomes bottleneck - *Cache coherence* is a pain - Example: 6-core chips on cluster

## Multithreaded processor machine - May have more threads than processors! - Can switch threads on long latency ops - Cray MTA was an extreme example - Similar to *hyperthreading* - But hyperthreading doesn’t switch – just schedules multiple threads onto same CPU functional units

## Distributed shared memory - Non-Uniform Memory Access (NUMA) - Can *logically* share memory while *physically* distributing - Any processor can access any address - Cache coherence is still a pain - Example: SGI Origin (or multiprocessor nodes on cluster)

## Message-passing programming model - Collection of named processes - Data is *partitioned* - Communication by send/receive of explicit message - Lingua franca: MPI (Message Passing Interface)

## Message passing dot product: v1 Processor 1: Processor 2: 1. Partial sum s1 1. Partial sum s2 2. Send s1 to P2 2. Send s2 to P1 3. Receive s2 from P2 3. Receive s1 from P1 4. s = s1 + s2 4. s = s1 + s2 What could go wrong? Think of phones vs letters...

## Message passing dot product: v2 Processor 1: Processor 2: 1. Partial sum s1 1. Partial sum s2 2. Send s1 to P2 2. Receive s1 from P1 3. Receive s2 from P2 3. Send s2 to P1 4. s = s1 + s2 4. s = s1 + s2 Better, but what if more than two processors?

## MPI: the de facto standard - Pro: *Portability* - Con: least-common-denominator for mid 80s The “assembly language” (or C?) of parallelism... but, alas, assembly language can be high performance.

## Distributed memory machines - Each node has local memory - ... and no direct access to memory on other nodes - Nodes communicate via network interface - Example: our cluster! - Other examples: IBM SP, Cray T3E

## The story so far - *Serial* performance as groundwork - Complicated function of architecture and memory - Understand to design data and algorithms - *Parallel* performance - Serial issues + communication/synch overheads - Limit: parallel work available (Amdahl) - Also discussed serial architecture and some of the basics of parallel machine models and programming models. - Next: Parallelism and locality in simulations