“Exploration in the Wild”, 2018-12-14 (; backlinks; similar):
Making good decisions requires people to appropriately explore their available options and generalize what they have learned. While computational models have successfully explained exploratory behavior in constrained laboratory tasks, it is unclear to the extent these models generalize to complex real-world choice problems.
We investigate the factors guiding exploratory behavior in a dataset consisting of 195,333 customers placing 1,613,967 orders from a large online food delivery service, Deliveroo.
We find important hallmarks of adaptive exploration and generalization, which we analyze using computational models.
We find evidence for several theoretical predictions: (1) customers engage in uncertainty-directed exploration, (2) they adjust their level of exploration to the average restaurant quality in a city, and (3) they use feature-based generalization to guide exploration towards promising restaurants.
Our results provide new evidence that people use sophisticated strategies to explore complex, real-world environments.
…In the attempt to test algorithms of both directed exploration and generalization simultaneously, we compared 3 models of learning and decision-making based on how well they captured the sequential choices of 3,772 new customers who had just started ordering food and who had rated all of their orders.
The first model was a Bayesian Mean Tracker (BMT) that does not generalize across restaurants, only learning about a restaurant’s quality by sampling it. [fixed-effects model?] The second model used Gaussian Process regression to learn about a restaurant’s quality based on the 4 observable features (price, mean rating, delivery time, and number of past ratings). Gaussian Process regression is a powerful model of generalization and has been applied to model how participants learn latent functions to guide their exploration. This model was either paired with a mean-greedy sampling strategy (GP-M) or with a directed exploration strategy that sampled based on an option’s upper confidence bound (GP-UCB). [So no evaluation of Thompson sampling?]
We treated customers’ choices as the arms of a bandit and their order ratings as their utility, and then evaluated each model’s performance based on its one-step-ahead prediction error, standardizing performance by comparing to a random baseline. Since it was not possible to observe all restaurants a customer might have considered at the time of an order, we compared the different models based on how much higher in utility they predicted a customer’s final choice compared to an option with average features.
The BMT model barely performed above chance (R2 = 0.013; 99.9% CI: 0.005–0.022). Although the GP-M model performed better than the BMT model (R2 = 0.231; 99.9% CI: 0.220–0.241), the GP-UCB model achieved by far the best performance (R2 = 0.477; 99.9% CI: 0.465–0.477).
Thus, a sufficiently predictive model of customers’ choices required both a mechanism of generalization (learning how features map onto rewards) and a directed exploration strategy (combining a restaurant’s mean and uncertainty to estimate its decision value).