Resources

Programming Massively Parallel Processors (PMPP) TakeAways

Chapter 1. Introduction

Design philosophies for CPU/GPU design:

  • CPU: latency oriented, optimized for sequential code performance
  • GPU: throughput oriented (floating-point calculations and memory access throughput, “parallel”)

Two types of bounds

  • Memory bound: applications where execution speed is limited by memory access latency and/or throughput
  • Compute bound: … limited by the number of instructions performed per byte of data

Chapter 2. Heterogeneous data parallel computing

Structure of CUDA C program: a host (CPU) and one or more deviced (GPU). Device code is marked with special CUDA C keywords. Device code includes kernels whose code is executed in a data-parallel manner.

When a kernel function is called, a large number of threads are launched on a device. All the threads that are launched by a kernel can be collectively called a grid. Each grid is organized as an array of thread blocks, simplified as blocks. All blocks are of the same size (up to 1024 threads).

Allocating memories in CUDA C:

  • cudaMalloc(): allocates object in device global memory
    • Address of a pointer to the allocated object
    • Size of allocated object in terms of byte
  • cudaFree(): frees object from device global memory
    • Pointer to the freed object
  • cudaMemcpy(): memory data transfer
    • Pointer to destination
    • Pointer to source
    • Number of bytes copied
    • Type of transfer (cudaMemcpyHostToDevice, …)

An simplified example:

void vecAdd(float* A_h, float* B_h, float* C_h, int n) {
    int size = n * sizeof(float);
    float *A_d, *B_d, *C_d;

    cudaMalloc((void **) &A_d, size);
    cudaMalloc((void **) &B_d, size);
    cudaMalloc((void **) &C_d, size);

    cudaMemcpy(A_d, A_h, size, cudaMemcpyHostToDevice);
    cudaMemcpy(B_d, B_h, size, cudaMemcpyHostToDevice);

    // kernel invocation code
    ...

    cudaMemcpy(C_h, C_d, size, cudaMemcpyDeviceToHost);

    cudaFree(A_d);
    cudaFree(B_d);
    cudaFree(C_d);
}

Built-in variables in CUDA kernel:

  • blockDim: the number of threads in a block. It is a struct with three unsigned integers (x, y, z) to organize the threads into a one-, two-, or three-dimensional array.
  • threadIdx: distinguish each thread in a block. Also (x, y, z).
  • blockIdx: distinguish each block.

A unique global index for a thread in a one-dimensional layout is:

int i = threadIdx.x + blockDim.x * blockIdx.x;

The kernel function for vecAdd:

__global__
void vecAddKernel(float* A, float* B, float* C, int n) {
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    if (i < n) {
        C[i] = A[i] + B[i];
    }
}

__global__ is a CUDA-C-specific keyword to indicate that the following function is a kernel.

CUDA C keywords for function declaration:

KeywordCall FromExecuted OnExecuted By
__host__ (default)HostHostCaller host thread
__global__Host (or Device)DeviceNew grid of device threads
__device__DeviceDeviceCaller device thread

To call the kernel function:

vecAddKernel<<<ceil(n/256.0), 256>>>(A_d, B_d, C_d, n);

The kernel function has two execution configuration parameters between <<< and >>>.

  • The first: number of blocks in the grid
  • The second: number of threads in each block

Compilation of CUDA C:

  • NVCC (Nvidia C compiler)
  • Host code: compiled with host’s standard C/C++ compilers, run as traditional CPU process
  • Device code: marked with CUDA keywords, compiled by NVCC (device just-in-time compiler)