This blog came from a sudden realisation of how little I knew about how matrix multiplication works on the GPU. Having done so many ML projects, I feel like I ought to understand how the most important operation in ML works: What is this "Tensor Core" thing? Why does everyone say "data movement is the bottleneck"? How fast can GPUs actually go?
To answer these questions, I decided that I must go out of my PyTorch bubble and venture into the abyss of CUDA. I wrote this blog to document all that I have learnt, and hopefully anyone reading this wouldn't have to go through the pain of digging through CUDA docs/code as I did.
If there is anything that I've learnt in this journey, it is concurrent matrix multiplication is HARD. Efficient matrix multiplication heavily depends on the specific hardware you are using and the problem size you are trying to solve. There is no one-size-fits-all solution.
Enough nagging, let's dig in!
Let's remind ourselves how (NVIDIA) GPUs work. A GPU achieves parallelism by running many threads. Each thread is executed on a single CUDA core, though at a given time, only a subset of the threads are active, so there can be many more threads than CUDA cores available. Each thread, no matter it is active or not, has its own set of registers.
A group of 32 threads is known as a warp. All threads in a warp must execute together (or be inactive together). In most cases, there are a lot more inactive warps than active warps, and the warp scheduler is responsible for choosing which warps to execute at a given time. This allows the GPU to hide the latency of memory accesses by scheduling other warps to execute while a warp is waiting for data.
A group of warps is known as a threadblock. All warps in a threadblock are executed in the same Streaming Multiprocessor (SM). Each threadblock has its own shared memory that can be accessed by all threads in the threadblock.
Note: Newer architectures
From Volta architecture onwards, each thread also has its own program counter and call stack etc. This means that each thread in a warp can execute different instructions at the same time.
The Volta architecture also introduced Tensor Cores that are specialised to solve matrix multiplications of specific sizes. Each active warp have access to one Tensor Core.
In the newest Hopper architecture, there is a concept of threadblock clusters that represents a group of threadblocks. It gives the user more fine-grained control over the scheduling of threadblocks and allows the shared memory of one threadblock to be access by other threadblocks in the same cluster.
Suppose we want to multiply two matrices
Specifically, we can partition
We can see that each sub-matrix
In practice,
Note: Padding
At any point where the problem size is not divisible by the partition size, we need to add padding. This is typically done implicitly when we load the partitioned inputs (
On a high level, three nested partitions happen to parallelise matrix multiplication on the GPU:
Matrix multiplication can easily become memory-bound if we naively re-fetch data from global memory to registers everytime we perform a computation. The key observation is that many of the sub-inputs
In CUDA, there are three types of user-accessible memory:
Memory type | Capacity | Bandwidth & latency |
---|---|---|
Global memory | ||
Shared memory | ||
Registers |
Here's a high-level view of how each memory type is utilised:
On the threadblock level, the problem is partitioned into sub-problems of size
Redundant data movement is minimised by loading the sub-inputs
On the warp level, the sub-problem is further partitioned into sub-sub-problems of size
Redundant data movement is minimised by loading the sub-inputs
Note: Distributing data across registers
It is worth noting that registers are thread-level only. This means that inputs in a register cannot be accessed by other threads in a warp. The exact way of how
To actually perform the matrix multiplication, we use the Tensor Cores on the GPU. My GPU (RTX 2060) has the second generation Tensor Cores, which are specialised to solve problems of size
where
Note
Tensor Core operations are warp-level instructions, meaning that all the threads in a warp need to execute the Tensor Core instruction at the same time, collaboratively preparing the data to be consumed by one Tensor Core.
So, given that we want to minimise data movement, we should just choose a partition size as large as possible to use all shared memory and registers, right? Well, not quite.
Asymptotically, as the problem size increases, yes, we do want to use as much shared memory and registers as possible. However, for small problem sizes, we might run into two problems:
A typical implementation might use a partition size of
In general, having a large warp partition size means there will be less redundant data movement, but at the cost of having fewer warps. Having too few warps means that we will not be able to hide the latency of memory accesses (because we might run out of other warps to schedule while the current warp is waiting for data).
A typical implementation might use a partition size of
This is completely determined by what instructions your GPU supports. For my RTX 2060, the ptx instruction for fp16 Tenor Core matrix multiplication (with fp16 accumulation) is mma.sync.aligned.m16n8k8.row.col.f16.f16.f16.f16
, which expects inputs of size
The above techniques can get us close to the theoretical peak performance of the GPU when the problem size is large. However, for smaller problem sizes, they are not as efficient. There are two common techniques to further improve the performance of matrix multiplication: parallel-K reduction and software pipelining.
In cases where
In cases where
and accumulates the intermediate results into
The caveat is that now, we need to allocate more memory to store the results from each threadblock, and invoke a second kernel to perform a final reduction over the partial results to get
Normally, CUDA hides the latency of memory accesses by scheduling other warps to execute while a warp is waiting for data. This requires us to have enough warps to mask the latency.
However, the number of warps is typically relatively small when doing GEMM. This is because the number of warps is limited by
The CUTLASS docs mention that "The accumulator elements typically occupy at least half a thread's total register budget".
To mitigate this effect, we can use software pipelining. In essence, we (manually) preload the inputs for the next iteration of the loop asynchronously using special instructions. While the inputs are being loaded, we can continue to compute on the current iteration. It is summarised by the following diagram:
This is made possible by the fact that the GPU is like any modern CPU: it can pipeline memory accesses and arithmetic operations as long as there is no data dependency between them. This is known as instruction-level parallelism.
If you want to see how all these concepts come together in a real implementation, check out my implementation of training MNIST from scratch with CUDA. There, I trained a multi-layer perceptron on MNIST using CUDA, achieving 6x speedup over optimised PyTorch for medium-sized networks.