learn CUDA 3D

Previously https://malaysia-ai.org/learn-cuda-2d that we know CUDA threads and blocks support three dimensional space, (x, y, z), mentioned at https://docs.nvidia.com/cuda/cuda-c-programming-guide/#thread-hierarchy

Quick recap, in 2D, we do,

```cuda
// 16 * 16 = 256
dim3 threadsPerBlock(16, 16);
dim3 blocksPerGrid((columns + threadsPerBlock.x - 1) / threadsPerBlock.x, 
                    (rows + threadsPerBlock.y - 1) / threadsPerBlock.y);
VecAdd<<<blocksPerGrid, threadsPerBlock>>>
```

Which is view as,

2D Grid of Thread Blocks Block (0,0) Block (1,0) Block (2,0) Block (0,1) blockIdx.y blockIdx.x Thread (0,0) Thread (1,0) Thread (0,1) Thread (1,1) Thread (0,0) Thread (1,0) Thread (0,1) Thread (1,1)

Generated by Claude Sonnet 3.5 with minimal changes.

As stated, third dimension is always there, with size of 1, we can initiate it as `dim3 threadsPerBlock(16, 16, 1)`, 16 x 16 x 1 = 64 threads, view as a cuboid.

3D Grid of Thread Blocks blockIdx.x blockIdx.y blockIdx.z (0,0,0) (1,0,0) (0,1,0)

Generated by Claude Sonnet 3.5 with minimal changes.

Because the size of third dimension is only 1, so the z coordinate is always 0, it is a cuboid, but depth-less.

x, y, z coordinates still follow the same principal, block * blockdim + thread,

- In order to know x coordinate, `blockIdx.x * blockDim.x + threadIdx.x`

- In order to know y coordinate, `blockIdx.y * blockDim.y + threadIdx.y`

- In order to know z coordinate, `blockIdx.z * blockDim.z + threadIdx.z`

Assumed we defined `rows = 100`, `columns = 100` and `z = 100`, so to convert from 3D coordinate to 1D, let say (0, 0, 1), is not that straight forward. As we know, 3D is just an array of 2D, a stack of 2D blocks on top of 2D blocks, so when we say (0, 0, 1), it is at (0, 0) coordinate at second block of 2D.

So because it is located at second block of 2D, 1 * size of rows * size of columns == 1 * 100 * 100 = 10000, plus with the coordinate of (0, 0), which is last time defined as row * columns + col == 0 * 100 + 0 == 0.

10000 + 0 = 10000, so the coordinate in 1D space is 10000.

How about (11, 3, 4)? 4 * 100 * 100 + 11 * 100 + 3 = 41103.

3D Grid of Thread Blocks blockIdx.x blockIdx.y blockIdx.z (0,0,0) (1,0,0) (0,1,0) (1,0,1) 3D to 1D index conversion: i is the current x-coordinate j is the current y-coordinate k is the current z-coordinate 1d = k * rows * columns + j * columns + i

Generated by Claude Sonnet 3.5 with minimal changes.

Now let us code 3D vector add operation,

```cuda
#include <stdio.h>
#include <cuda_runtime.h>

// Kernel definition
__global__ void VecAdd(int *a, int *b, int *c, int rows, int columns, int z) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  int j = blockIdx.y * blockDim.y + threadIdx.y;
  int k = blockIdx.z * blockDim.z + threadIdx.z;

  int idx = k * rows * columns + j * columns + i;
  c[idx] = a[idx] + b[idx];
  
}

int main()
{
    int rows = 100;
    int columns = 100;
    int z = 100;
    size_t size = rows * columns * z * sizeof(int);
    
    int *h_a, *h_b, *h_c;
    
    int *d_a, *d_b, *d_c;
    
    h_a = (int*)malloc(size);
    h_b = (int*)malloc(size);
    h_c = (int*)malloc(size);
    
    for (int i = 0; i < rows * columns * z; i++) {
        h_a[i] = i;
        h_b[i] = i * 2;
    }
    
    cudaMalloc(&d_a, size);
    cudaMalloc(&d_b, size);
    cudaMalloc(&d_c, size);
    
    cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);

    // 8 * 8 * 8 = 512
    dim3 threadsPerBlock(8, 8, 8);
    dim3 numBlocks((columns + threadsPerBlock.x - 1) / threadsPerBlock.x, 
                   (rows + threadsPerBlock.y - 1) / threadsPerBlock.y,
                   (z + threadsPerBlock.z - 1) / threadsPerBlock.z);
    VecAdd<<<numBlocks, threadsPerBlock>>>(d_a, d_b, d_c, rows, columns, z);

    cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost);

    for (int i = 0; i < rows * columns; i++) {
        if (h_c[i] != h_a[i] + h_b[i]) {
            printf("Error: %d + %d != %d\n", h_a[i], h_b[i], h_c[i]);
            break;
        }
    }
    
    free(h_a); free(h_b); free(h_c);
    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
}
```

Save the complete code above above (the long one) as `test.cu`, and then you can continue to compile,

```bash
nvcc test.cu -o test
./test
```

Let us try to use debugger,

```bash
nvcc -G -g -o test test.cu
cuda-gdb test
break VecAdd
run
```

```text
Using host libthread_db library "/usr/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7f2b0d8d1000 (LWP 700605)]
[New Thread 0x7f2b07fff000 (LWP 700606)]
[Detaching after fork from child process 700607]
[New Thread 0x7f2b057bc000 (LWP 700614)]
[New Thread 0x7f2b04fbb000 (LWP 700615)]
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]

Thread 1 "test" hit Breakpoint 1, VecAdd<<<(13,13,13),(8,8,8)>>> (a=0x7f2adba00000, b=0x7f2adea00000, c=0x7f2adee00000, rows=100, columns=100, z=100) at test.cu:6
6         int i = blockIdx.x * blockDim.x + threadIdx.x;
```

No issue.