âLearning Cooperative Visual Dialog Agents With Deep Reinforcement Learningâ, 2017-03-20 ()â :
We introduce the first goal-driven training for visual question answering and dialog agents. Specifically, we pose a cooperative âimage guessingâ game between two agentsâQbot and Abotâwho communicate in natural language dialog so that Qbot can select an unseen image from a lineup of images. We use deep reinforcement learning (RL) to learn the policies of these agents end-to-endâfrom pixels to multi-agent multi-round dialog to game reward.
We demonstrate two experimental results.
First, as a âsanity checkâ demonstration of pure RL (from scratch), we show results on a synthetic world, where the agents communicate in ungrounded vocabulary, i.e., symbols with no pre-specified meanings
(X, Y, Z).We find that two bots invent their own communication protocol and start using certain symbols to ask/answer about certain visual attributes (shape/color/style). Thus, we demonstrate the emergence of grounded language and communication among âvisualâ dialog agents with no human supervision.
Second, we conduct large-scale real-image experiments on the VisDial dataset, where we pretrain with supervised dialog data and show that the RL âfine-tunedâ agents outperform SL agents.
Interestingly, the RL Qbot learns to ask questions that Abot is good at, ultimately resulting in more informative dialog and a better team.