Parallel Programming in C and Python

Amit Saha

Issue #217, May 2012

A fast-track guide to writing parallel programs in C and Python.

Advancements in computing technologies have enabled even “resource-constrained” Netbooks to have four CPU cores. Thus, parallel computing is not about having a large number of computer nodes in a cluster room anymore. To write programs that take advantage of such easily available parallel computing resources, C and Python programmers have libraries at their disposal, such as OpenMP (C) and multiprocessing (Python) for shared memory parallel programming, and OpenMPI (C/Python) for distributed memory parallel programming. And, in case you don't have hardware resources beyond your quad-core personal computer, cloud-computing solutions, such as PiCloud, can be of tremendous use.

For the purposes of this article, I assume you have one computer at your disposal and, thus, focus on OpenMP, multiprocessing and PiCloud. Note that the article's code examples demonstrate parallel programming and may not be the most optimal strategy possible for parallelization. I also have refrained from any benchmarking to show speedups of the parallel programs versus their serial counterparts.

Shared Memory Parallel Programming

Shared memory parallel programming libraries enable writing programs that can take advantage of the multiple CPUs on your computer.

OpenMP (Open specifications for Multi-Processing) is a standard API specification that may be used explicitly to achieve multithreaded, shared memory parallelism. The API is defined for C/C++ and FORTRAN, and it is composed of three primary API components: compiler directives (#pragma in C), runtime libraries and environment variables. An OpenMP program begins life as the master thread until it encounters a parallel block. A team of parallel threads is then created; this operation is called the fork operation. The statements in this parallel block are executed in parallel by these threads. Once the team of threads has finished execution, the master thread is back in control, which is called the Join operation. This model of execution is referred to as the Fork-Join model.

Let's use the GNU implementation of OpenMP, called libgomp. Your distribution's package manager is the easiest way to install this library. To enable the compiler directives that you use in your OpenMP programs, specify -fopenmp, and to link to the runtime library, a -lgomp has to be added during the compilation of your OpenMP programs.

Let's start by creating a team of threads (Listing 1). First, use the omp_get_num_procs() function to get the number of processors available, which you use to set the number of threads using the omp_set_num_threads() function. Next, specify the parallel block of the program using #pragma omp parallel. You want each thread to have a private copy of the tid variable, so append this information to the #pragma directive. In the parallel block, let's simply print a hello world message from all of the threads, including the master thread. The master thread has a tid of 0, and it can be used to execute statements specific to it. The thread id of a particular thread is obtained using the omp_get_thread() function.

Once you compile and execute this program, the output, depending on the number of processors on your computer, will be similar to the following:

$ gcc -o ompdemo ompdemo.c -lgomp -fopenmp
$ ./ompdemo

Number of processors available:: 2
Hello World from thread = 0
Number of threads = 2
Hello World from thread = 1

Next, let's consider a code snippet in your program that may look like this:

A -> Large array
For every element e in array A
 call function, fun(e)
end for 

Instead of calling the function, fun for every element, e, serially, what if you could do it in parallel? OpenMP's work-sharing constructs enable you to do that. Consider the code snippet below:

A -> Large array
#pragma omp for schedule(dynamic,chunk)
For every element e in array A
 call function, fun(e)
end for 

The directive, #pragma omp for schedule(dynamic,chunk) specifies that the ensuing for loop will be executed in parallel by the thread team, with the workload specified by the schedule(dynamic, chunk) clause. The chunk variable specifies the unit of division—that is, the number of iterations of the for loop that will be executed by a thread, and dynamic specifies that the chunks will be assigned dynamically to the threads.

Let's use this directive to process a large array, calling a function for each array element (Listing 2). The parallel section of the code begins at the directive #pragma omp parallel where we specify that the arrays a and b and the variable chunk will be shared among the threads. This is required, because all the threads will be accessing the arrays a and b, and chunk will be used by the ensuing parallel for loop to determine the unit of division. The variable, i, specifying the loop, will be private to each thread.

The result of the function evaluation will be available at the end of the parallel section in the array b:

$ gcc -o omp_for_eval omp_for_eval.c -lgomp -fopenmp
$ ./omp_for_eval 
Number of processors available:: 2
For evaluation completed, the result has been stored in array B

For the next example, let's calculate the value of Pi using a Monte Carlo sampling technique. The core of the algorithm involves generating pairs of random points in the range (0,1) and keeping a count of the number of points that lie in the first quadrant of a unit circle. The number of samples generated have a direct effect on the accuracy of the estimate of the value of Pi. With the help of OpenMP, let's divide the number of samples equally among multiple threads and finally accumulate the result in the master thread to calculate the value of Pi (Listing 3).

In this code, notice the use of the reduction clause, #pragma omp parallel shared(chunk) reduction(+:count). This effectively means to perform the + operation on the private copies of the variable count, and store it in the global shared variable count. The value of count then is used to calculate the value of Pi:

$ gcc -o pi_openmp pi_openmp.c -lgomp -fopenmp
$ ./pi_openmp 
Number of processors available:: 2
Enter the number of iterations used to estimate pi 
↪(multiple of 2 please): 10000
# of iterations = 10000 , estimate of pi is 3.1636 
[gene@zion openmp]$ ./pi_openmp 
Number of processors available:: 2
Enter the number of iterations used to estimate pi 
↪(multiple of 2 please): 1000000
# of iterations = 1000000 , estimate of pi is 3.14222 

Python's Multiprocessing Module

Python's multiprocessing module effectively side-steps the Global Interpreter Lock that inhibits true multithreaded Python programs. With this module, your Python programs can make use of multiple processors on your computer. Listing 4 is a Python program that uses this module to get the number of CPUs on your computer and creates that number of processes.

A process is created by creating an object of the Process class: p= Process(target=f, name='Process'+str(i), args=(i,)). The target argument specifies the callable object to be called by the process's run() method, which is invoked by the start() method; name specifies a custom name for the process; and args specifies the argument tuple for the target. Depending on the number of processors on your computer, you should see output similar to the following:

$ python mpdemo.py
You have 2 CPUs
Process::  Process1 Was assigned PID::  29269
Process::  Process2 Was assigned PID::  29270

This program maintains a list of the created process objects so as to wait for them to finish using the join() method. If you attempt to use this example as a starting point to actually do something, you will want a way to send back information from the spawned processes to the master process. There are a few ways to achieve this. The easiest, but not really recommended way, is to use shared state variables. Another way to establish a communication channel between the master and the spawned processes is to use Queues. Listing 5 uses the multiprocessing module to calculate the dot product of a large array by splitting the calculation across multiple processes.

For every process created, a pair of queue objects are created, one each for the two arrays. The whole array is divided into equal chunks, and a process is assigned a chunk to calculate the partial product. The partial products are stored in the queue object product. After the processes have finished execution, partial products are retrieved from the queue and summed to get the actual dot product. Here is sample output:

$ python mp_queue.py
You have 2 CPUs
Dot product::  3333350.00017

Next, let's take a look at Process pools. As the name implies, a pool of processes is created using the Pool class, and each member of this process pool is then assigned a task to execute using methods, such as apply() or map(). Listing 6 demonstrates using Process pools to calculate the value of Pi using the same algorithm as shown in Listing 3.

The pool of processes is created using the statement pool = Pool(processes=np), and each process in this pool is asked to invoke the function monte_carlo_pi_part using count=pool.map(monte_carlo_pi_part, part_count), where part_count is a list with the number of samples to be generated in each process. The returned value from each process is stored in the list, count. Finally, let's sum this list and calculate the estimated value of Pi:

$ python pi_mp.py
You have 2 CPUs
Estimated value of Pi::  3.1412128

The call to map() is a blocking call—that is, the process doesn't terminate until the result has been received. For other applications, the asynchronous version may be desirable.

The multiprocessing module has other features, such as the ability to execute remote jobs by the use of managers, proxy objects and a number of synchronization primitives. See the module documentation for more information.

Parallel Computing in the Cloud with PiCloud

If you have run out of cores on your workstation and want to make more computing power accessible to your program easily, it's worth looking into PiCloud.

To start using PiCloud, create an account on the project Web site, and then install the client from the PyPi package index. The Quickstart documentation should help you set up PiCloud on your computer (see Resources).

Let's write a simple function and run it on PiCloud. First, define a function sort_num() in the file demo.py as follows:

import numpy 

def sort_num(num): 
    sort_num = numpy.sort(num) 
    return sort_num 

Fire up the Python interpreter and import the function you just defined and the modules cloud and numpy:

>>> from demo import *
>>> import cloud
>>> import numpy

Next, create an array of 50000 integers with each integer chosen randomly from the range (10, 10000):

>>> num=numpy.random.random_integers(10,10000,50000)

Now comes the important step, invoking the cloud.call method to specify the function that you want to run and its arguments:

>>> jid=cloud.call(sort_num,num)

This implies that you want the function sort_num to be executed with the argument num. This method returns an integer that is an identifier for this particular job. The formal specification of the cloud.call method is available at docs.picloud.com/moduledoc.html#cloud.call:

>>> jid
58

The returned value of the executed function can be obtained using the function cloud.result using the obtained jid:

>>> cloud.result(jid)
array([   10,    11,    11, ..., 10000, 10000, 10000])

As you can see, the returned array is obtained. Next, consider a function (defined as a file mapdemo.py), which returns the volume of a cylinder when the radius(r) and height(h) is passed to it:

import numpy

def vol_cylinder(r,h):
    return numpy.pi*r*r*h

If you had one cylinder configuration—that is, one pair of (r,h)—the cloud.call method can be used for the purpose. What if you had 100 different configurations to square? You could use 100 cloud.call invocations, but there is a more efficient way of doing it: using the cloud.map function, which accepts a sequence of parameters (or sequences of parameters):

>>> from mapdemo import * 
>>> import cloud 
>>> import numpy 
>>> r=numpy.random.random_sample(100) 
>>> h=numpy.random.random_sample(100) 
>>> jids=cloud.map(vol_cylinder,r,h) 
>>> jids
xrange(359, 459)

As you can see, jids is a list of 100 elements with values from 359 to 459 (note that the values you have may be different). Here, cloud.map has created 100 jobs with the arguments formed from the 100 pairs of the two sequences, r and h. You can pass this list of jids directly to cloud.result to get the results as a list:

>>> result=code.result(jids)
>>> result 
[0.35045338986267927, 0.0043144690799585004, 
↪0.094018119621969765, 0.93579722612039329, 
↪0.0045154147876736109, 0.018836324478609345, 
↪0.0027243595262778321, 1.5049675511377265, 
↪0.37383416636274164, 0.24435487403102638, 
↪0.28248315493701553, 1.2879340600324913, 
↪0.68406526971023041, 0.14338739850272786,...

In more practical situations, the function computation may not be so trivial, and you may not have a definite idea when all the jobs will be over. In such a case, you can wait until all the jobs have finished to retrieve the results using the function cloud.join. The PiCloud documentation uses the map function to calculate the value of Pi (yes, again!) using the map() method (see Resources).

Among a host of other interesting features, PiCloud allows your Python application to expose your functions in PiCloud via a REST API, which will allow you to call this function from any other programming language. See the Resources section of this article for an example. PiCloud allows you to set up cron jobs, use different computing resources and use environments to install any native libraries for use in your application.

Distributed Memory Parallel Programming

The past few sections in this article give a taste of parallel programming from a device as resource-constrained (relatively!) as a dual-core desktop. What if you already have a 30-node computing cluster at your disposal? In that case, you would use the tried-and-tested OpenMPI, an implementation of the MPI-2 specification. Another solution for setting up a computing cluster is the Parallel Virtual Machine (PVM). If you don't have multiple computers and still want to try out OpenMPI programming, you simply can run it on a standalone computer or use a virtualization solution, such as VirtualBox, to create a cluster of virtual machines.

Conclusion

My intention with this article was to provide an introduction to parallel programming. Because the hardware requirements are really basic, I mostly looked at shared memory parallel programming. I also covered a more modern way of parallel computing using a cloud-computing service. Another very interesting project, Parallel Python, enables writing shared memory as well as distributed memory parallel programs in Python. I hope the examples in this article help you get started exploring parallel programming from the comfort of your Netbook!

Amit Saha is currently a PhD research student in the area of Evolutionary Algorithms and Optimization. Like his random echoes show (echorand.me), he has been writing on myriad Linux and open-source technologies for the past five years. He welcomes comments on this article and beyond at amitsaha.in@gmail.com.