learn CUDA 2D

Last time we talked about 1D vector add operation at https://malaysia-ai.org/learn-cuda, it is straight forward, only use `x` coordinate.

Now let's go to 2D add operation, and actually CUDA supports up to 3D for threads and blocks https://docs.nvidia.com/cuda/cuda-c-programming-guide/#thread-hierarchy, before this for 1D,

```cuda
int N = 1000000;
int threadsPerBlock = 1024;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
VecAdd<<<blocksPerGrid, threadsPerBlock>>>
```

Which is, view as,

1D Grid of Thread Blocks Block (0) Block (1) Block (2) blockIdx.x Thread (0) Thread (1)

Generated by Claude Sonnet 3.5

All threads and blocks been aligned as one dimensional only, because we only pass it as one dimensional! If you pass it as two dimensional, we just expand Y dimension become size of one,

```cuda
int N = 1000000;
dim3 threadsPerBlock(1024, 1); // (x, y)
dim3 blocksPerGrid((columns + threadsPerBlock.x - 1) / threadsPerBlock.x, 
                   (rows + threadsPerBlock.y - 1) / threadsPerBlock.y);
VecAdd<<<blocksPerGrid, threadsPerBlock>>>
```

It will become,

2D Grid of Thread Blocks Block (0,0) Block (1,0) Block (2,0) blockIdx.y blockIdx.x Thread (0,0) Thread (1,0)

Generated by Claude Sonnet 3.5

Because we only initiate Y dimension as one, so the coordinate is always 0. In order to use multi-dimensional threads in CUDA, you must use `dim3` data type, if we pass an integer N, behind the scence it passed as `dim3(N, 1, 1)`.

So, let us continue to write simple 2D vector sum operation,

```cuda
#include <stdio.h>
#include <cuda_runtime.h>

// Kernel definition
__global__ void VecAdd(int *a, int *b, int *c, int columns) {
  int row = blockIdx.y * blockDim.y + threadIdx.y;
  int col = blockIdx.x * blockDim.x + threadIdx.x;

  int i = row * columns + col;
  c[i] = a[i] + b[i];
  
}

int main()
{
    int rows = 1000;
    int columns = 1000;
    size_t size = rows * columns * sizeof(int);
    
    int *h_a, *h_b, *h_c;
    
    int *d_a, *d_b, *d_c;
    
    h_a = (int*)malloc(size);
    h_b = (int*)malloc(size);
    h_c = (int*)malloc(size);
    
    for (int i = 0; i < rows * columns; i++) {
        h_a[i] = i;
        h_b[i] = i * 2;
    }
    
    cudaMalloc(&d_a, size);
    cudaMalloc(&d_b, size);
    cudaMalloc(&d_c, size);
    
    cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);


    // 16 * 16 = 256
    dim3 threadsPerBlock(16, 16);
    dim3 numBlocks((columns + threadsPerBlock.x - 1) / threadsPerBlock.x, 
                   (rows + threadsPerBlock.y - 1) / threadsPerBlock.y);
    VecAdd<<<numBlocks, threadsPerBlock>>>(d_a, d_b, d_c, columns);

    cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost);

    for (int i = 0; i < rows * columns; i++) {
        if (h_c[i] != h_a[i] + h_b[i]) {
            printf("Error: %d + %d != %d\n", h_a[i], h_b[i], h_c[i]);
            break;
        }
    }
    
    free(h_a); free(h_b); free(h_c);
    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
}
```

As you can see, we defined `rows = 1000` and `columns = 1000`, but the array definition is still one dimension `size = rows * columns`.

Because it is still one dimensional, in order to convert from 1D coordinate to 2D coordinate, (1D // size of columns, 1D % size of columns), if my index at 10040, so 2D is (10040 // 1000, 10040 % 1000) == (10, 40)

If you have 2D coordinate and want to convert to 1D, 10 * columns + 40 == 10040.

So that is why we defined `int i = row * columns + col` in our kernel.

2D Grid of Thread Blocks Block (0,0) Block (1,0) Block (2,0) Block (0,1) blockIdx.y blockIdx.x Thread (0,0) Thread (1,0) Thread (0,1) Thread (1,1)

Generated by Claude Sonnet 3.5

To know the actual row position, you must use `blockIdx.y * blockDim.y + threadIdx.y`, and column position `blockIdx.x * blockDim.x + threadIdx.x`, as we know, we defined the thread size is (16, 16), area of 256, square of 16.

Current thread which `threadIdx.y` must add with current block `blockIdx.y` multiply by size of thread, which is 16.

So if I am at the third row of thread, and this thread at the second row of blocks, so the Y coordinate == 2 * 16 + 3 == 35.

If I am at nineth column of thread, and this thread at the fifth column of blocks, so the X coordinate == 5 * 16 + 9 == 89.

To get 1D coordinate, 35 * 1000 + 89 == 35089, so that is why our kernel,

```cuda
__global__ void VecAdd(int *a, int *b, int *c, int columns) {
  int row = blockIdx.y * blockDim.y + threadIdx.y;
  int col = blockIdx.x * blockDim.x + threadIdx.x;

  int i = row * columns + col;
  c[i] = a[i] + b[i];
  
}
```

Save the complete code above above (the long one) as `test.cu`, and then you can continue to compile,

```bash
nvcc test.cu -o test
./test
```

Let us try to use debugger,

```bash
nvcc -G -g -o test test.cu
cuda-gdb test
break VecAdd
run
```

```text
[New Thread 0x7ffb00204000 (LWP 561497)]
[New Thread 0x7ffafeef4000 (LWP 561498)]
[Detaching after fork from child process 561499]
[New Thread 0x7ffaf7fff000 (LWP 561506)]
[New Thread 0x7ffaf77fe000 (LWP 561507)]
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]

Thread 1 "test" hit Breakpoint 1, VecAdd<<<(63,63,1),(16,16,1)>>> (a=0x7ffacba00000, b=0x7ffacea00000, c=0x7ffacee00000, columns=1000) at test.cu:6
6         int row = blockIdx.y * blockDim.y + threadIdx.y;
```

No issue.