“Blender: A State-Of-The-Art Open Source Chatbot”, Stephen Roller, Jason Weston, Emily Dinan2020-04-29 (, , ; similar)⁠:

…This is the first time a chatbot has learned to blend several conversational skills—including the ability to assume a persona, discuss nearly any topic, and show empathy—in natural, 14-turn conversation flows. Today we’re sharing new details of the key ingredients that we used to create our new chatbot…Our new recipe incorporates not just large-scale neural models, with up to 9.4 billion parameters—or 3.6× more than the largest existing system—but also equally important techniques for blending skills and detailed generation…We used previously available public domain conversations that involved 1.5 billion training examples of extracted conversations. Our neural networks are too large to fit on a single device, so we used techniques such as column-wise model parallelism, which allows us to split the neural network into smaller, more manageable pieces while maintaining maximum efficiency. Such careful organization of our neural networks enabled us to handle larger networks than we could previously while maintaining the high efficiency needed to scale to terabyte-size data sets.

…However, to make sure conversational agents don’t repeat themselves or display other shortcomings, researchers typically use a number of possible generation strategies after the model is trained, including beam search, next token sampling, and n-gram blocking. We find that the length of the agent’s utterances is important in achieving better results with human evaluators. If they’re too short, the responses are dull and communicate a lack of interest; if they’re too long, the chatbot seems to waffle and not listen. Contrary to recent research, which finds that sampling outperforms beam search, we show that a careful choice of search hyperparameters can give strong results by controlling this trade-off. In particular, tuning the minimum beam length gives important control over the “dull versus spicy” spectrum of responses.

In this graph, we show how often human evaluators preferred our chatbots to human-to-human chats over time. Since 2018, we’ve improved model performance in this evaluation—from 23% in 2018 to 49% today.