The growing application of deep neural networks in safety-critical domains makes the analysis of faults that occur in such systems of enormous importance. In this paper we introduce a large taxonomy of faults in deep learning (DL) systems. We have manually analysed 1059 artefacts gathered from GitHub commits and issues of projects that use the most popular DL frameworks (TensorFlow, Keras and PyTorch) and from related Stack Overflow posts. Structured interviews with 20 researchers and practitioners describing the problems they have encountered in their experience have enriched our taxonomy with a variety of additional faults that did not emerge from the other two sources. Our final taxonomy was validated with a survey involving an additional set of 21 developers, confirming that almost all fault categories (13â15) were experienced by at least 50% of the survey participants.
Layer Properties: wrong input sample size; wrong defined input shape; wrong defined output shape; both wrong; wrong filter size in convolution; missing bias; wrong number of neurons in layer; wrong amount & type of pooling in convolutional layer; layer dimension mismatch
Model Type & Properties:
wrong model initialization
wrong weight initialization
multiple initializations of CNN
wrong selection of model
wrong network architecture
suboptimal network structure
GPU Usage:
missing destination GPU device
incorrect state sharing
wrong reference to GPU device
wrong data parallelism on GPUs
calling unsupported operations on CUDA tensors
conversion to CUDA tensor inside the training/test loop
wrongly implemented data transfer function (CPU-GPU)
missing transfer of data to GPU
wrong tensor transfer to GPU
GPU tensor is used instead of CPU tensor
API:
deprecated API
wrong use of image decoding API
wrong position of data shuffle operation
missing global variables initialization
wrong API usage
missing API call
wrong reference to operational graph
wrong usage of placeholder restoration API
missing argument scoping
Training:
Training Data Quality:
wrong labels for training data
wrong selection of features
unbalanced training data
not enough training data
low quality training data
overlapping output classes in training data
too many output categories
small range of values for a feature
discarding important features
Training Process: wrong management of memory resources; reference for non-existing checkpoint; model too big to fit into available memory; missing data augmentation; redundant data augmentation
Optimizer: wrong optimization function; epsilon for Adam optimizer too low
Loss Function: wrong loss function calculation; missing masking of invalid values to zero; wrong selection of loss function; missing loss function