I wrote this article to document my approach for writing parallel code in BSPonMPI v0.4. The idea is to write a single piece of C++ code which can scale from running on your desktop/laptop up to running on an HPC cluster.
I will write a few more detailed pieces of documentation for each of the above soon, this post shows a bit of a preview.
BSP gives us a framework for answering two essential questions about a parallel algorithm:
Many people have made slides about BSP. Here are mine.
When thinking about a BSP computation, we alternate between computation and communication phases. During the computation phase, a number of processors work independently, and (possibly) in parallel. In the communication phase, they exchange their results as necessary to either get their inputs for the next computation phase, or to store their output.
Most of the libraries shown above implement the BSPlib standard which was created in the 1990’s as a simple, but generic programming interface for parallel programming. However, MPI succeeded as the most popular method for parallel programming on clusters. For fine-grained/SMP-style parallel programming, OpenMP and the threading building blocks (TBB) are among the most succesful high-level tools to use.
BSPonMPI becomes useful if you have some fast sequential code and are looking for a quick way to have a parallel version that runs both on desktop/laptops (for small problem instances) and scales on larger cluster systems.
Our first program will not do much: each processor will say hello and write a value (its ID) to a global memory array. Then, each processor will read a value from this array and output it:
The whole file can be found here, the following is a short walkthrough of what it does.
Two essential concepts introduced in v0.4 are contexts and runners.
A context includes everything a processor will need to run part of a parallel computation. In the BSP model, each context object stores the data for exactly one BSP processor’s local memory
A runner performs the task of assigning contexts to physical processors. It’s nice to un-couple logical and physical processors: it enables us to start thinking about a lot of things – like overpartitioning (making subproblems smaller), fault-tolerance (we could make a copy of a context and map it many times), etc.
A context is implemented by deriving a class from
The parallel computation happens between
in between there will be executed in parallel on a number of logical processors.
BSP-style communication is achieved using a subset of the BSP communication prototypes which were part of BSPonMPI before, too.
BSP_END() a bit special. You will need to use
BSP_SYNC() rather than
bsp_sync(), and there are a few more restrictions.
Logical processors are mapped to physical cores via TBB multithreading and MPI. The number of logical processors is given to the runner later on in the main function:
The separation between communication and computation gets us out of trouble w.r.t. a lot of the issues normally encountered with multithreaded programming (variable sharing, etc.). Unfortunately, not all of them, since all our processors still share the same console on one node. Here is how to say hello without garbling the output:
One easy way to exchange data between logical processors is global memory. In a global memory block, data is mapped to processors in an arbitrary fashion, and processors can read/write in BSP style (i.e. request read/write operations which will be carried out in the communication phase of the superstep).
Global memory must be allocated on node-level (outside
run() function in
MyContext, inside the
BSP_BEGIN/END block, we can now
access this memory (the handle
h is declared as a global variable – note that these
variables need to be accessed in a thread-safe fashion):
So, why use BSPonMPI?
A legitimate first question is to ask why one should not just use MPI + OpenMP/TBB/std::thread.
My answer is: Sure, if you want to write the fastest-possible code for a specific HPC system, and have got enough time/budget for programming: Use MPI + whichever Threading Library is best on there.
However, if you need code that will be easy to maintain and port between systems, structuring computation/communication in a BSP-like fashion will be useful: computation and communication routines can then be maintained and adapted separately. This is where BSPonMPI comes in, it wraps up MPI and TBB in one package with a simple programming interface. That’s what I use it for.