Attention Is All You Need
Transformers Learn Shortcuts to Automata
Neural Networks and the Chomsky Hierarchy
Masked Hard-Attention Transformers and Boolean RASP Recognize Exactly the Star-Free Languages
Show Your Work: Scratchpads for Intermediate Computation with Language Models
Exploring Length Generalization in Large Language Models
https://arxiv.org/pdf/2402.09963.pdf#page=2
https://arxiv.org/pdf/2402.09963.pdf#page=27
https://arxiv.org/pdf/2402.09963.pdf#page=34
Sensitivity as a Complexity Measure for Sequence Classification Tasks