Facebook AI has built and open-sourced Blender, the largest-ever open-domain chatbot. It outperforms others in terms of engagement and also feels more human, according to human evaluators.
The culmination of years of research in conversational AI, this is the first chatbot to blend a diverse set of conversational skills—including empathy, knowledge, and personality—together in one system.
We achieved this milestone through a new chatbot recipe that includes improved decoding techniques, novel blending of skills, and a model with 9.4 billion parameters, which is 3.6× more than the largest existing system.
Today we’re releasing the complete model, code, and evaluation set-up, so that other AI researchers will be able to reproduce this work and continue to advance conversational AI research.
…This is the first time a chatbot has learned to blend several conversational skills—including the ability to assume a persona, discuss nearly any topic, and show empathy—in natural, 14-turn conversation flows. Today we’re sharing new details of the key ingredients that we used to create our new chatbot…Our new recipe incorporates not just large-scale neural models, with up to 9.4 billion parameters—or 3.6× more than the largest existing system—but also equally important techniques for blending skills and detailed generation…We used previously available public domain conversations that involved 1.5 billion training examples of extracted conversations. Our neural networks are too large to fit on a single device, so we used techniques such as column-wise model parallelism, which allows us to split the neural network into smaller, more manageable pieces while maintaining maximum efficiency. Such careful organization of our neural networks enabled us to handle larger networks than we could previously while maintaining the high efficiency needed to scale to terabyte-size data sets.
…However, to make sure conversational agents don’t repeat themselves or display other shortcomings, researchers typically use a number of possible generation strategies after the model is trained, including beam search, next token sampling, and n-gram blocking. We find that the length of the agent’s utterances is important in achieving better results with human evaluators. If they’re too short, the responses are dull and communicate a lack of interest; if they’re too long, the chatbot seems to waffle and not listen. Contrary to recent research, which finds that sampling outperforms beam search, we show that a careful choice of search hyperparameters can give strong results by controlling this trade-off. In particular, tuning the minimum beam length gives important control over the “dull versus spicy” spectrum of responses.
In this graph, we show how often human evaluators preferred our chatbots to human-to-human chats over time. Since 2018, we’ve improved model performance in this evaluation—from 23% in 2018 to 49% today.