Bibliography (3):

  1. CLEVRER: CoLlision Events for Video REpresentation and Reasoning

  2. CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning

  3. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding