## Parallel Programming in C++

Theoretical and practical scalability

# Introduction

BSP is a (theoretical) model for parallel computation (L.G. Valiant. A bridging model for parallel computation. CACM 33:8, Aug 1990, doi: 10.1145/79173.79181). It gives a way to make formal arguments about the computation and communication costs, as well as the scalability of parallel algorithms.

It is also a practical programming model. Many parallel programs, regardless of their programming language or parallel programming library, are essentially in BSP style.

Here is a recent list of Q&A’s and links

I wrote this article to document my approach for writing parallel code in BSPonMPI v0.4. The idea is to write a single piece of C++ code which can scale from running on your desktop/laptop up to running on an HPC cluster.

BSPonMPI 0.3 includes a few examples, however, the new version in my Github project adds these new features:

• Hybrid BSP programming using MPI and TBB.
• A broadcasting and reducing framework (half-way to MapReduce – allows us to easily initialise distributed variables, and combine results)
• SPMD command line processing

I will write a few more detailed pieces of documentation for each of the above soon, this post shows a bit of a preview.

# Bulk-synchronous parallelism (BSP)

BSP gives us a framework for answering two essential questions about a parallel algorithm:

1. What is the asymptotic speedup when solving a problem of size n on p processors?
2. What is the asymptotic overhead in communication for distributing the input/output of this computation?

When thinking about a BSP computation, we alternate between computation and communication phases. During the computation phase, a number of processors work independently, and (possibly) in parallel. In the communication phase, they exchange their results as necessary to either get their inputs for the next computation phase, or to store their output.

## Specialised BSP Libraries

There are various specialised libraries for BSP-style programming. Among others, there are:

Here are some older patches to make PUB and the Oxford BSP Toolset work on more recent systems:

# Writing BSP programs in C++

Most of the libraries shown above implement the BSPlib standard which was created in the 1990’s as a simple, but generic programming interface for parallel programming. However, MPI succeeded as the most popular method for parallel programming on clusters. For fine-grained/SMP-style parallel programming, OpenMP and the threading building blocks (TBB) are among the most succesful high-level tools to use.

BSPonMPI becomes useful if you have some fast sequential code and are looking for a quick way to have a parallel version that runs both on desktop/laptops (for small problem instances) and scales on larger cluster systems.

# Hello World

Our first program will not do much: each processor will say hello and write a value (its ID) to a global memory array. Then, each processor will read a value from this array and output it:

The whole file can be found here, the following is a short walkthrough of what it does.

## Contexts and Runners

Two essential concepts introduced in v0.4 are contexts and runners.

Contexts

A context includes everything a processor will need to run part of a parallel computation. In the BSP model, each context object stores the data for exactly one BSP processor’s local memory

Runners

A runner performs the task of assigning contexts to physical processors. It’s nice to un-couple logical and physical processors: it enables us to start thinking about a lot of things – like overpartitioning (making subproblems smaller), fault-tolerance (we could make a copy of a context and map it many times), etc.

A context is implemented by deriving a class from bsp::Context:

The parallel computation happens between BSP_BEGIN() and BSP_END(). Everything in between there will be executed in parallel on a number of logical processors. BSP-style communication is achieved using a subset of the BSP communication prototypes which were part of BSPonMPI before, too.

Code between BSP_BEGIN() and BSP_END() a bit special. You will need to use BSP_SYNC() rather than bsp_sync(), and there are a few more restrictions.

Logical processors are mapped to physical cores via TBB multithreading and MPI. The number of logical processors is given to the runner later on in the main function:

The separation between communication and computation gets us out of trouble w.r.t. a lot of the issues normally encountered with multithreaded programming (variable sharing, etc.). Unfortunately, not all of them, since all our processors still share the same console on one node. Here is how to say hello without garbling the output:

## Global Memory

One easy way to exchange data between logical processors is global memory. In a global memory block, data is mapped to processors in an arbitrary fashion, and processors can read/write in BSP style (i.e. request read/write operations which will be carried out in the communication phase of the superstep).

Global memory must be allocated on node-level (outside BSP_BEGIN/BSP_END):

Within the run() function in MyContext, inside the BSP_BEGIN/END block, we can now access this memory (the handle h is declared as a global variable – note that these variables need to be accessed in a thread-safe fashion):

# Conclusion

So, why use BSPonMPI?

A legitimate first question is to ask why one should not just use MPI + OpenMP/TBB/std::thread.

My answer is: Sure, if you want to write the fastest-possible code for a specific HPC system, and have got enough time/budget for programming: Use MPI + whichever Threading Library is best on there.

However, if you need code that will be easy to maintain and port between systems, structuring computation/communication in a BSP-like fashion will be useful: computation and communication routines can then be maintained and adapted separately. This is where BSPonMPI comes in, it wraps up MPI and TBB in one package with a simple programming interface. That’s what I use it for.

Category: research Tags: parallel programming c++ bsp bsponmpi