Bibliography (3):
CLEVRER: CoLlision Events for Video REpresentation and Reasoning
CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding