Parallel programming in fortran using OpenMP

Basic OpenMP usage illustrated by computing the Julia fractal set

In order to use OpenMP features in fortran with the GNU Fortran compiler, one should use the -fopenmp flag when compiling and linking. Once the feature is enabled, the omp_lib package will be available.

program omp_test
  use omp_lib
  implicit none
  print "(A, I2, A)", "There are ", omp_get_num_threads(), " threads available in this process."
  print "(A, I2 ,A)", "I'm thread number ", omp_get_thread_num(), "."
end program

Once everything is setup correctly, we can start experimenting with openmp directives. These are macro-like comment statements that you can add to your code without modifying it unless the -fopenmp option is used.

For example, let's write a simple loop that prints numbers from one to ten:

!$omp parallel do
do i = 1, 10
  print *, "The number is ", i
end do
!$omp end parallel do

We can observe that when -fopenmp is active, the numbers do not always show up in the right order. Welcome to parallel computing! This is something to remember, when you go parallel, never assume any execution order.

All of those directives will start with a !$omp statement, you can also use !$ to only compile a line when -fopenmp is active. This is useful for our first example:

  !$ print "(A, I2, A)", "There are ", omp_get_num_threads(), " threads available in this process."
  !$ print "(A, I2 ,A)", "I'm thread number ", omp_get_thread_num(), "."

Now these two lines only compile when OpenMP is enabled, thus avoiding compilation errors in our sequential program.

Application to a fractal computation

With this simple parallel do construct, we can already build a parallel program. Let's take for exemple a program computing a Fractal.

We compile and run the basic program like so

gfortran julia.f90 -o julia.exe
hyperfine ./julia.exe 4096

where 4096 is the resolution of the fractal image (4096x4096) and hyperfine is a benchmarking tool.

We get an average wall-time of 435 ms on an M1 pro Apple silicon in the sequential version. One smart way to achieve a speedup would be to do all pixels in parallel. Unfortunately, we do not have enough cores on our CPU. OpenMP will hopefully manage the scheduling of the 4096 outer loop iterations to my 8 cpu threads by just telling him:

!$omp parallel do
do i = 1, N
  do j = 1, N
    ! Julia fractal thingy
    img(j,i) = ...
  end do
end do
!$omp end parallel do

Care your loop orders, breaking contiguous memory access in a multithreaded program is even more costly. Here $j$ is the fast index and $i$ the slow index (fortran is column-major).

And so we get 855.8ms average, damned, we just killed performance, we were supposed to increase it! Several things actually happened, yes we parallelized our outer loop but the fractal looks like this:

It looks like a heavily corrupted and recursive version of the fractal, think for a moment about the possible causes.

The root of the problem lies in the management of shared and private variables among threads, in fact, our error was to forget about what is inside the loops:

! Inside of the nested do loops
z0i = -2.0 + i / real(N) * 4.0
z0r = -2.0 + j / real(N) * 4.0
z0 = CMPLX(z0r, z0i)
img(j,i) = convergence(z0, c, kmax)

The ownership of variables z0, z0r, z0i and img have to be specified or all the threads will edit them at the same time! In our case, we need to make z0, z0r, z0i private to each thread, img will be shared by default

!$omp parallel do private(z0, z0r, z0i) shared(img)
do i = 1, N
  do j = 1, N
    ! Julia fractal thingy
    img(j,i) = ...
  end do
end do
!$omp end parallel do

Now we get in 121ms

To go even faster, we will introduce another decorator for this omp do construct: schedule. Schedule enables us to tell gfortran that we do not want one thread for each iteration of the loop but rather une thread per X iterations. A sensible value for X is the number of physical cores of your device, hence for my 4-cores CPU, $X=512$.

!$omp parallel do private(z0, z0r, z0i) shared(img) schedule(STATIC, 512)

You will however notice that this value is far from being the optimum. To understand this deviation, we have to consider what time it takes to compute one pixel. In the case where all pixel wall times are equal, 512 is the optimimum but in reality, the fractal's pixels have different computation times ranging from x1 to x25 for a depth of 25 grey values. The number of iterations in the pixel computation is not predictible, so we must steer away from a static schedule where each thread gets the same amount of pixels. Alternatives are DYNAMIC or GUIDED schedules.

Most of the information present in this article is extracted from this source.

home