“Real-Time Neural Radiance Caching for Path Tracing”, Thomas Müller, Fabrice Rousselle, Jan Novák, Alexander Keller2021-06-23 (, )⁠:

We present a real-time neural radiance caching method for path-traced global illumination. Our system is designed to handle fully dynamic scenes, and makes no assumptions about the lighting, geometry, and materials.

The data-driven nature of our approach sidesteps many difficulties of caching algorithms, such as locating, interpolating, and updating cache points. Since pretraining neural networks to handle novel, dynamic scenes is a formidable generalization challenge, we do away with pretraining and instead achieve generalization via adaptation, i.e. we opt for training the radiance cache while rendering. We employ self-training to provide low-noise training targets and simulate infinite-bounce transport by merely iterating few-bounce training updates.

The updates and cache queries incur a mild overhead—about 2.6ms on full HD resolution—thanks to a streaming implementation of the neural network that fully exploits modern hardware.

We demonstrate noise reduction at the cost of little induced bias, and report state-of-the-art, real-time performance on a number of challenging scenarios.

Figure 7: Our fully fused neural network outperforms an equivalent XLA-enabled TensorFlow (v2.5.0) implementation. Both implementations utilize half precision floating point numbers and TensorCore hardware for matrix multiplication. We compare the throughput of training (left) and inference (right) for a 64 (solid line) and a 128 (dashed line) neurons wide multi-layer perceptron. The relevant batch sizes for our goal of neural radiance caching are small training batches (eg. 214 elements) and large inference batches (eg. 221 elements for evaluating a 1920 × 1080 frame). For these batch sizes, the speed-up over TensorFlow ranges 5–10×.

4. Fully Fused Neural Networks: We implemented our neural network from scratch in a GPU programming language in order to take full advantage of the GPU memory hierarchy. In Figure 7, we compare the performance of this implementation to TensorFlow (v2.5.0) [Abadi et al 2015], which we outperform by almost an order of magnitude. [cf. Persistent RNNs, QRNNs]

To understand where this dramatic speedup comes from, we examine the bottleneck of evaluating a fully-connected neural network like ours. The computational cost of such a neural network scales quadratically with its width, whereas its memory traffic scales linearly. Modern GPUs have vastly larger computational throughput than they have memory bandwidth, though, meaning that for narrow neural networks like ours, the linear memory traffic is the bottleneck. The key to improving performance is thus to minimize traffic to slow “global” memory (VRAM and high-level caches) and to fully utilize fast on-chip memory (low-level caches, “shared” memory, and registers).

Our fully fused approach does precisely this: we implement the entire neural network as a single GPU kernel that is designed such that the only slow global memory accesses are reading and writing the network inputs and outputs. Furthermore, implementing the kernel from scratch as opposed to building it out of existing frameworks allows us to specifically tailor the implementation to the network architecture and the GPU that we use.

Figure 6 illustrates how the fully fused approach is mapped to the memory hierarchy. Using CUDA terminology: a given batch of input vectors is partitioned into block-column segments that are processed by a single thread block each (Figure 6b). The thread blocks independently evaluate the network by alternating between weight-matrix multiplication and element-wise application of the activation function. By making the thread blocks small enough such that all intermediate neuron activations fit into on-chip shared memory, traffic to slow global memory is minimized. This is the key advantage of the fully fused approach and stands in contrast to typical implementations of general matrix multiplication

Within a matrix multiplication (Figure 6c), each warp of the thread block computes the matrix product of a single block-row (striped area). In our case, the striped weights in Wi are few enough to fit into the registers of the warp and can thus be re-used for every block of Hi+1 that the warp computes, yielding an additional performance gain. Furthermore, since each warp loads a distinct block-row of the weight matrix, the entire thread block loads the weight matrix from global memory exactly once, which cannot be reduced further.

The only possible remaining reduction of global memory traffic is thus to minimize the number of thread blocks by making them as large as fits into shared memory. On our hardware (NVIDIA RTX 3090) and with our 64-neurons-wide network, this sweet-spot is met when each thread block processes 128 elements of the batch. Each thread block thus computes matrix products of a 64 × 64 weight matrix with a 64 × 128 chunk of the data.