Bibliography (3):

CLEVRER: CoLlision Events for Video REpresentation and Reasoning
CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding