Bibliography (19):

Vision Transformer: An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale
PaLI: A Jointly-Scaled Multilingual Language-Image Model
ImageNet: A Large-Scale Hierarchical Image Database
LiT: Zero-Shot Transfer with Locked-image Text Tuning
Revisiting Unreasonable Effectiveness of Data in Deep Learning Era
Scaling Vision Transformers
A domain-specific supercomputer for training deep neural networks
The Efficiency Misnomer
PaLM: Scaling Language Modeling with Pathways
Efficiently Scaling Transformer Inference
Partial success in closing the gap between human and machine vision
https://arxiv.org/pdf/2302.05442.pdf#page=40&org=google
ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness
Revisiting the Calibration of Modern Neural Networks
https://arxiv.org/pdf/2302.05442#page=14&org=google
https://arxiv.org/pdf/2302.05442.pdf#page=37&org=google
Distilling the Knowledge in a Neural Network
Wikipedia Bibliography:
1. https://en.wikipedia.org/wiki/Cross-entropy_loss :
  
  https://en.wikipedia.org/wiki/Cross-entropy_loss
2. Kullback-Leibler divergence