“NNUE: The Neural Network of the Stockfish Chess Engine”, 2021-01-08 (; backlinks; similar):
The real cleverness of Stockfish’s neural network is that it’s an efficiently-updatable neural network (NNUE). Specifically, it’s a simple feedforward network with:
a large (10.5M parameters!) input layer, illustrated below, that can use two different levels of sparsity for computational efficiency;
three much smaller layers (with 17.5k parameters in total) which are evaluated densely using vector instructions;
a single scalar output to give a numerical score for the position, indicating how favourable it is for the player about to move.
Everything is done using integer arithmetic, with 16-bit weights in the first layer and 8-bit weights in the remaining layers…The inputs to the layer are two sparse binary arrays, each consisting of 41,024 elements. It may seem highly redundant to encode a chess position using 82,048 binary features, but this is similar to an approach (called ‘feature crosses’) used in recommender systems.
…There are two levels of sparsity which are used when computing this affine transformation from ℝ41,024 to ℝ256, allowing the network to be efficiently evaluated many times in a tree search:
the 41,024-element implicit vectors are themselves sparse: the number of nonzero elements is equal to the number of non-king pieces on the board.
moving a piece typically changes very few of the entries of the vector: if it’s a regular non-king move, only 2 entries change; if it’s a non-king move with capture, then 3 entries change.
It’s this second aspect which warrants the name ‘efficiently updatable’: when a move is made (or unmade, since we’re doing a tree search), we only need to add/subtract a few 256-element matrix columns from the resulting ‘dense worldview’ to update it.
Unless a king is moved, this (2 or 3 vector additions/subtractions) beats summing all of the matrix columns corresponding to nonzero entries (up to 30 vector additions), which in turn unconditionally beats doing a regular dense matrix-vector multiplication (41,024 vector additions). That is to say, the second-level sparsity is about 10× more efficient than the first-level sparsity, which is in turn about 1000× more efficient than naively doing a dense matrix-vector multiplication.