# CS 5220 ## Distributed memory ### MPI ## 08 Oct 2015
### Previously on Parallel Programming Can write a lot of MPI code with 6 operations we’ve seen: - `MPI_Init` - `MPI_Finalize` - `MPI_Comm_size` - `MPI_Comm_rank` - `MPI_Send` - `MPI_Recv` ... but there are sometimes better ways. Decide on communication style using simple performance models.
### Reminder: basic send and recv MPI_Send(buf, count, datatype, dest, tag, comm); MPI_Recv(buf, count, datatype, source, tag, comm, status); `MPI_Send` and `MPI_Recv` are *blocking* - Send does not return until data is in system - Recv does not return until data is ready

Blocking and buffering


Block until data in system — maybe in a buffer?

Blocking and buffering


Alternative: don’t copy, block until done.

Problem 1: Potential deadlock


Both processors wait to finish send before they can receive! May not happen if lots of buffering on both sides.

Solution 1: Alternating order


Could alternate who sends and who receives.

Solution 2: Combined send/recv


Common operations deserve explicit support!

### Combined sendrecv MPI_Sendrecv(sendbuf, sendcount, sendtype, dest, sendtag, recvbuf, recvcount, recvtype, source, recvtag, comm, status); Blocking operation, combines send and recv to avoid deadlock.

Problem 2: Communication overhead


Partial solution: nonblocking communication

### Blocking vs non-blocking communication - `MPI_Send` and `MPI_Recv` are *blocking* - Send does not return until data is in system - Recv does not return until data is ready - Cons: possible deadlock, time wasted waiting - Why blocking? - Overwrite buffer during send $\implies$ evil! - Read buffer before data ready $\implies$ evil! - Alternative: *nonblocking* communication - Split into distinct initiation/completion phases - Initiate send/recv and promise not to touch buffer - Check later for operation completion

Overlap communication and computation


### Nonblocking operations Initiate message: MPI_Isend(start, count, datatype, dest tag, comm, request); MPI_Irecv(start, count, datatype, dest tag, comm, request); Wait for message completion: MPI_Wait(request, status); Test for message completion: MPI_Test(request, status);
### Multiple outstanding requests Sometimes useful to have multiple outstanding messages: MPI_Waitall(count, requests, statuses); MPI_Waitany(count, requests, index, status); MPI_Waitsome(count, requests, indices, statuses); Multiple versions of test as well.
### Other send/recv variants Other variants of `MPI_Send` - `MPI_Ssend` (synchronous) – complete after receive begun - `MPI_Bsend` (buffered) – user provides buffer - via `MPI_Buffer_attach` - `MPI_Rsend` (ready) – must have receive already posted - Can combine modes (e.g. `MPI_Issend`) `MPI_Recv` receives anything.
### Another approach - Send/recv is one-to-one communication - An alternative is one-to-many (and vice-versa): - *Broadcast* to distribute data from one process - *Reduce* to combine data from all processors - Operations are called by all processes in communicator
### Broadcast and reduce MPI_Bcast(buffer, count, datatype, root, comm); MPI_Reduce(sendbuf, recvbuf, count, datatype, op, root, comm); - `buffer` is copied from root to others - `recvbuf` receives result only at root - `op` is `MPI_MAX`, `MPI_SUM`, etc
### Example: basic Monte Carlo #include <stdio.h> #include <mpi.h> int main(int argc, char** argv) { int nproc, myid, ntrials; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nproc); MPI_Comm_rank(MPI_COMM_WORLD, &my_id); if (myid == 0) { printf("Trials per CPU:\n"); scanf("%d", &ntrials); } MPI_Bcast(&ntrials, 1, MPI_INT, 0, MPI_COMM_WORLD); run_trials(myid, nproc, ntrials); MPI_Finalize(); return 0; }
### Example: basic Monte Carlo Let sum[0] = $\sum_i X_i$ and sum[1] = $\sum_i X_i^2$. void run_mc(int myid, int nproc, int ntrials) { double sums[2] = {0,0}; double my_sums[2] = {0,0}; /* ... run ntrials local experiments ... */ MPI_Reduce(my_sums, sums, 2, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if (myid == 0) { int N = nproc*ntrials; double EX = sums[0]/N; double EX2 = sums[1]/N; printf("Mean: %g; err: %g\n", EX, sqrt((EX*EX-EX2)/N)); } }
### Collective operations - Involve all processes in communicator - Basic classes: - Synchronization (e.g. barrier) - Data movement (e.g. broadcast) - Computation (e.g. reduce)
### Barrier MPI_Barrier(comm); Not much more to say. Not needed that often.

Broadcast


Scatter/gather


Allgather


Alltoall


Reduce


Scan


### The kitchen sink - In addition to above, have vector variants (v suffix), more All variants (`Allreduce`), `Reduce_scatter`, ... - MPI3 adds one-sided communication (put/get) - MPI is *not* a small library! - But a small number of calls goes a long way - `Init`/`Finalize` - `Get_comm_rank`, `Get_comm_size` - `Send`/`Recv` variants and `Wait` - `Allreduce`, `Allgather`, `Bcast`