mass communicating

Parallel Programming in C++

Theoretical and practical scalability

Introduction

About BSP

BSP is a (theoretical) model for parallel computation (L.G. Valiant. A bridging model for parallel computation. CACM 33:8, Aug 1990, doi: 10.1145/79173.79181). It gives a way to make formal arguments about the computation and communication costs, as well as the scalability of parallel algorithms.

It is also a practical programming model. Many parallel programs, regardless of their programming language or parallel programming library, are essentially in BSP style.

Here is a recent list of Q&A’s and links

There is also a Wikipedia page

I wrote this article to document my approach for writing parallel code in BSPonMPI v0.4. The idea is to write a single piece of C++ code which can scale from running on your desktop/laptop up to running on an HPC cluster.

BSPonMPI 0.3 includes a few examples, however, the new version in my Github project adds these new features:

I will write a few more detailed pieces of documentation for each of the above soon, this post shows a bit of a preview.

Bulk-synchronous parallelism (BSP)

BSP gives us a framework for answering two essential questions about a parallel algorithm:

  1. What is the asymptotic speedup when solving a problem of size n on p processors?
  2. What is the asymptotic overhead in communication for distributing the input/output of this computation?

Many people have made slides about BSP. Here are mine.

When thinking about a BSP computation, we alternate between computation and communication phases. During the computation phase, a number of processors work independently, and (possibly) in parallel. In the communication phase, they exchange their results as necessary to either get their inputs for the next computation phase, or to store their output.

Specialised BSP Libraries

There are various specialised libraries for BSP-style programming. Among others, there are:

Here are some older patches to make PUB and the Oxford BSP Toolset work on more recent systems:

Writing BSP programs in C++

Most of the libraries shown above implement the BSPlib standard which was created in the 1990’s as a simple, but generic programming interface for parallel programming. However, MPI succeeded as the most popular method for parallel programming on clusters. For fine-grained/SMP-style parallel programming, OpenMP and the threading building blocks (TBB) are among the most succesful high-level tools to use.

BSPonMPI becomes useful if you have some fast sequential code and are looking for a quick way to have a parallel version that runs both on desktop/laptops (for small problem instances) and scales on larger cluster systems.

Hello World

Our first program will not do much: each processor will say hello and write a value (its ID) to a global memory array. Then, each processor will read a value from this array and output it:

Hi, I am processor 1 of 4
Hi, I am processor 4 of 4
Hi, I am processor 2 of 4
Hi, I am processor 3 of 4
Hi, I am processor 1 of 4, I have read value 3 from the global memory 
Hi, I am processor 4 of 4, I have read value 0 from the global memory 
Hi, I am processor 2 of 4, I have read value 2 from the global memory 
Hi, I am processor 3 of 4, I have read value 1 from the global memory 

The whole file can be found here, the following is a short walkthrough of what it does.

Contexts and Runners

Two essential concepts introduced in v0.4 are contexts and runners.

Contexts

A context includes everything a processor will need to run part of a parallel computation. In the BSP model, each context object stores the data for exactly one BSP processor’s local memory

Runners

A runner performs the task of assigning contexts to physical processors. It’s nice to un-couple logical and physical processors: it enables us to start thinking about a lot of things – like overpartitioning (making subproblems smaller), fault-tolerance (we could make a copy of a context and map it many times), etc.

A context is implemented by deriving a class from bsp::Context:

class MyContext : public bsp::Context {
public:
	MyContext() : counter (0) { }

	void init() {
		// ...
	}

	/** This is the overloaded run method. */
	void run() {
		/** the first line of executed code MUST BE like this. */
		BSP_SCOPE(MyContext);

		BSP_BEGIN();
		// everything in here runs in parallel and should only use 
		// member variables
		BSP_END();
	}


protected:
	/** Members must be public or protected, since BSP_BEGIN et al. 
	 *  rely on them being accessible to subclasses */
	int counter;

};

The parallel computation happens between BSP_BEGIN() and BSP_END(). Everything in between there will be executed in parallel on a number of logical processors. BSP-style communication is achieved using a subset of the BSP communication prototypes which were part of BSPonMPI before, too.

Code between BSP_BEGIN() and BSP_END() a bit special. You will need to use BSP_SYNC() rather than bsp_sync(), and there are a few more restrictions.

Logical processors are mapped to physical cores via TBB multithreading and MPI. The number of logical processors is given to the runner later on in the main function:

int main(int argc, char ** argv) {
	bsp_init (&argc, &argv);

	// Things from here on are node-level SPMD. 
	// You'll have as many processes in parallel as there are
	// available via MPI.

	int processors = 4; // or any other number > 1
	bsp::Runner<MyContext> r (processors);
	r.run();
	// ...
	bsp_end();
}

Multithreaded Hello

The separation between communication and computation gets us out of trouble w.r.t. a lot of the issues normally encountered with multithreaded programming (variable sharing, etc.). Unfortunately, not all of them, since all our processors still share the same console on one node. Here is how to say hello without garbling the output:

#include <tbb/spin_mutex.h>

/** this is a helper to make sure output doesn't get garbled */
tbb::spin_mutex output_mutex;

class MyContext : public bsp::Context { 
// ... 
protected:

	/**
	 * We can make functions which are called from within the 
	 * supersteps, but they need to be class member to get access
	 * to the correct bsp_... functions.
	 */
	void print_info () {
		using namespace std;
		tbb::spin_mutex::scoped_lock l (output_mutex);
		cout << "Hi, I am processor " << bsp_pid()+1 << " of " << bsp_nprocs() << endl;
	}
// ...

Global Memory

One easy way to exchange data between logical processors is global memory. In a global memory block, data is mapped to processors in an arbitrary fashion, and processors can read/write in BSP style (i.e. request read/write operations which will be carried out in the communication phase of the superstep).

Global memory must be allocated on node-level (outside BSP_BEGIN/BSP_END):

	/** we can still do stuff on node level here. like allocate 
	 *  global memory. Note that bsp_pushreg doesn't fall into 
	 *  this category, this, we want to do after BSP_BEGIN (if
	 *  the memory should be local to tasks rather than nodes, 
	 *  that is).
	 */
	h = bsp_global_alloc(processors * sizeof(int));
	::bsp_sync();

	bsp::Runner<MyContext> r (processors);
	r.run();

	bsp_global_free(h);

Within the run() function in MyContext, inside the BSP_BEGIN/END block, we can now access this memory (the handle h is declared as a global variable – note that these variables need to be accessed in a thread-safe fashion):

BSP_BEGIN();

/** 
 * Things from here on are task-level SPMD. 
 * You'll have as many processes as allocated in the task mapper
 * also, we inject all members of MyContext into the current
 * scope
 */

print_info();
counter = bsp_pid();

bsp_global_put(&counter, h, sizeof(int) * (bsp_nprocs()-1-bsp_pid()), sizeof(int));

BSP_SYNC();

bsp_global_get(h, sizeof(int) * (bsp_pid()), &counter, sizeof(int));

BSP_SYNC();

{
	tbb::spin_mutex::scoped_lock l (output_mutex);
	cout << "Hi, I am processor " << bsp_pid()+1 << " of " << bsp_nprocs() << ", I have read value " <<
		counter << " from the global memory " << endl;
}

BSP_END();

Conclusion

So, why use BSPonMPI?

A legitimate first question is to ask why one should not just use MPI + OpenMP/TBB/std::thread.

My answer is: Sure, if you want to write the fastest-possible code for a specific HPC system, and have got enough time/budget for programming: Use MPI + whichever Threading Library is best on there.

However, if you need code that will be easy to maintain and port between systems, structuring computation/communication in a BSP-like fashion will be useful: computation and communication routines can then be maintained and adapted separately. This is where BSPonMPI comes in, it wraps up MPI and TBB in one package with a simple programming interface. That’s what I use it for.

Category: research Tags: parallel programming c++ bsp bsponmpi