learn CUDA N Dimension

As we know, CUDA threads and blocks support up to three dimensional space only (x, y, z), but a tensor can go up to N dimensional space! For an example, Multi-head Attention is 4D, (batch size, head dimension, sequence length, dimension), or batch of videos is 5D (batch size, length video, height, width, depth), So how?

Let us look at 3D with shape(10, 10, 10),

3D with shape(10, 10, 10) blockIdx.x, 10 blockIdx.y, 10 blockIdx.z, 10 (0,0,0) (1,0,0) (0,1,0)

Actually you can reshape to become 2D by combining second and third dimensions to become (10, 100), yeah you can say, of course second dimension is 100 because 10 x 10 is equal to 100, straight forward, but how about geometrically? The intuitive of reshaping.

- We know that 3D got 1000 elements, 10 x 10 x 10.

- We know that 2D got 1000 elements, 10 x 100.

- To reshape 3D to become 2D, we basically take slices of 2D and put at the right side of 10 x 10, to become a rectangle.

10x10x10 Back Front Reshape 10x100 2D Array

Now you see the column is wider, become 100.

Because we combined from third dimension into second dimension, the slicing happened on third dimension, because as we know, 3D is just an array of 2D, if we slice at first dimension, it is an array of height x depth, if we slice at second dimension, it is an array of width x depth, and we slice at third dimension, it is an array of height x width.

Same goes if you want to reshape 4D to become 3D, you are slicing cubes at 4th dimension and put at 3rd dimension to become longer cuboid.

Same goes if you want to reshape 5D to become 3D, you do slicing inside slicing!

To code 5D vector add operation, is simple, you just do `z = z1 * z2 * zn ...`, which `z1` is the third dimension, `z2` is fourth dimension, `zn` is 3 + n dimension.

```cuda
#include <stdio.h>
#include <cuda_runtime.h>

// Kernel definition
__global__ void VecAdd(int *a, int *b, int *c, int rows, int columns, int z) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  int j = blockIdx.y * blockDim.y + threadIdx.y;
  int k = blockIdx.z * blockDim.z + threadIdx.z;

  int idx = k * rows * columns + j * columns + i;
  c[idx] = a[idx] + b[idx];
  
}

int main()
{
    int rows = 100;
    int columns = 100;
    int z1 = 2;
    int z2 = 4;
    int z3 = 5;
    int z = z1 * z2 * z3;
    size_t size = rows * columns * z * sizeof(int);
    
    int *h_a, *h_b, *h_c;
    
    int *d_a, *d_b, *d_c;
    
    h_a = (int*)malloc(size);
    h_b = (int*)malloc(size);
    h_c = (int*)malloc(size);
    
    for (int i = 0; i < rows * columns * z; i++) {
        h_a[i] = i;
        h_b[i] = i * 2;
    }
    
    cudaMalloc(&d_a, size);
    cudaMalloc(&d_b, size);
    cudaMalloc(&d_c, size);
    
    cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);

    // 8 * 8 * 8 = 512
    dim3 threadsPerBlock(8, 8, 8);
    dim3 numBlocks((columns + threadsPerBlock.x - 1) / threadsPerBlock.x, 
                   (rows + threadsPerBlock.y - 1) / threadsPerBlock.y,
                   (z + threadsPerBlock.z - 1) / threadsPerBlock.z);
    VecAdd<<<numBlocks, threadsPerBlock>>>(d_a, d_b, d_c, rows, columns, z);

    cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost);

    for (int i = 0; i < rows * columns; i++) {
        if (h_c[i] != h_a[i] + h_b[i]) {
            printf("Error: %d + %d != %d\n", h_a[i], h_b[i], h_c[i]);
            break;
        }
    }
    
    free(h_a); free(h_b); free(h_c);
    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
}
```

Save the complete code above above (the long one) as `test.cu`, and then you can continue to compile,

```bash
nvcc test.cu -o test
./test
```

Let us try to use debugger,

```bash
nvcc -G -g -o test test.cu
cuda-gdb test
break VecAdd
run
```

```text
[New Thread 0x7ffbe6b9f000 (LWP 1306063)]
[New Thread 0x7ffbe588f000 (LWP 1306064)]
[Detaching after fork from child process 1306065]
[New Thread 0x7ffbe4d73000 (LWP 1306078)]
[New Thread 0x7ffbddfbd000 (LWP 1306079)]
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]

Thread 1 "test" hit Breakpoint 1, VecAdd<<<(13,13,5),(8,8,8)>>> (a=0x7ffbb5a00000, b=0x7ffbb5c00000, c=0x7ffbb5e00000, rows=100, columns=100, z=40) at test.cu:6
6         int i = blockIdx.x * blockDim.x + threadIdx.x;
```

No issue.