Solving the Poisson equation using Nvidia CUDA

In this issue, we will go through the implementation of a simple `cuda` kernel for computing the Poisson equation.

The algorithm

This problem is particularly fit for GPU processing because of its innate parallelism, in fact even an asynchronous field update keeps the physics intact.

In fortran, the update code reads:

  do i = 1, nx-1
    do j = 1, ny-1
      V(i, j) = (
	                  EPS(i+1,j)/(dx**2)*V_old(i+1,j)
                + EPS(i-1,j)/(dx**2)*V_old(i-1,j)
                + EPS(i,j+1)/(dy**2)*V_old(i,j+1)
                + EPS(i,j-1)/(dy**2)*V_old(i,j-1)
                )
      V(i, j) = V(i,j) / 
                (
                  (EPS(i+1,j) + EPS(i-1,j)) / (dx**2) 
                + (EPS(i,j+1) + EPS(i,j-1)) / (dy**2)
                )
    end do
  end do

In order to use device memory, we will use the `thrust` library which provides an easy way to manage device and host arrays.

You can include the library with the following headers

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>

Those will also include `cuda.h`.

  // Device arrays
  thrust::device_vector dev_potential (NX * NY, 0.0);
  thrust::device_vector dev_epsilon   (NX * NY, 1.0);
  // Host arrays
  thrust::host_vector potential (NX * NY, 0.0);
  thrust::host_vector epsilon   (NX * NY, 1.0);

Where NX, NY are the potential and dielectric constant sampling grid dimensions.

Don't forget to declare your `blocks`' sizes and your `grid` size.

  dim3 DimBlock(BLOCK_SIZE, BLOCK_SIZE);
  dim3 DimGrid(32,32); 

  gpu_main<<<1, 1>>>(
    thrust::raw_pointer_cast(&dev_potential.front()), 
    thrust::raw_pointer_cast(&dev_epsilon.front()), 
    NX,
    50000,
    DimGrid,
    DimBlock);

With the corresponding prototypes below, the jacobi_iterator_GPU is a massively parallel `device` kernel and gpu_main is a single-threaded `global` function managing the main loop.

__global__ void jacobi_iterator_GPU(
    double * __restrict__ V, 
    const double * __restrict__ EPS, 
    const int n);
__global__ void gpu_main(
    double * V, 
    const double * EPS, 
    const int n, 
    const int tmax, 
    const dim3 DimGrid, 
    const dim3 DimBlock);

After dispatching the computation to the device, the main CPU thread needs to wait for device operation completion. This is achieved by using

cudaDeviceSynchronize();

To wall-time your implementation, you can use the following pattern in c++.

  auto start = std::chrono::system_clock::now();
  // Computations
  auto end = std::chrono::system_clock::now();
  std::chrono::duration elapsed_seconds = end-start;

  std::time_t end_time = std::chrono::system_clock::to_time_t(end);
 
  std::cout << "Finished computation at " << std::ctime(&end_time)
            << "elapsed time: " << elapsed_seconds.count() << "s\n";

Check out the for the complete fortran version.

"""

home