[code; cf. MuZero version, ReBeL, Perfect Information Monte Carlo (determinization), AlphaZe∗∗] Real-time planning in games with hidden state, using partially observable Monte-Carlo planning (POMCP). This paper introduces a Monte-Carlo algorithm for online planning in large POMDPs. The algorithm combines a Monte-Carlo update of the agent’s belief state with a Monte-Carlo tree search from the current belief state. [ie. a posterior sample (PSRL) variant of MCTS]
The new algorithm, POMCP, has two important properties. First, Monte-Carlo sampling is used to break the curse of dimensionality both during belief state updates and during planning. Second, only a black box simulator of the POMDP is required, rather than explicit probability distributions. These properties enable POMCP to plan effectively in much larger POMDPs than has previously been possible.
We demonstrate its effectiveness in 3 large POMDPs. We scale up a well-known benchmark problem, “rocksample”, by several orders of magnitude. We also introduce two challenging new POMDPs: 10 × 10 “battleship” and “partially observable PacMan”, with ~1018 and 1056 states respectively.
Our Monte-Carlo planning algorithm achieved a high level of performance with no prior knowledge, and was also able to exploit simple domain knowledge to achieve better results with less search. POMCP is the first general purpose planner to achieve high performance in such large and unfactored POMDPs.
In this paper we extend MCTS to partially observable environments (POMDPs). Full-width planning algorithms, such as value iteration6, scale poorly for two reasons, sometimes referred to as the “curse of dimensionality” and the “curse of history”12. In a problem with n states, value iteration reasons about an n-dimensional belief state. Furthermore, the number of histories that it must evaluate is exponential in the horizon. The basic idea of our approach is to use Monte-Carlo sampling to break both curses, by sampling start states from the belief state, and by sampling histories using a black box simulator.
Our search algorithm constructs, online, a search tree of histories. Each node of the search tree estimates the value of a history by Monte-Carlo simulation. For each simulation, the start state is sampled from the current belief state, and state transitions and observations are sampled from a black box simulator. We show that if the belief state is correct, then this simple procedure converges to the optimal policy for any finite horizon POMDP. In practice we can execute hundreds of thousands of simulations per second, which allows us to construct extensive search trees that cover many possible contingencies.
In addition, Monte-Carlo simulation can be used to update the agent’s belief state. As the search tree is constructed, we store the set of sample states encountered by the black box simulator in each node of the search tree. We approximate the belief state by the set of sample states corresponding to the actual history. Our algorithm, Partially Observable Monte-Carlo Planning (POMCP), efficiently uses the same set of Monte-Carlo simulations for both tree search and belief state updates.
Figure 1: An illustration of POMCP in an environment with 2 actions, 2 observations, 50 states, and no intermediate rewards. The agent constructs a search tree from multiple simulations, and evaluates each history by its mean return (left). The agent uses the search tree to select a real action a, and observes a real observation o (middle). The agent then prunes the tree and begins a new search from the updated history [h, a, o] (right).
…Each simulation begins from a start state that is sampled from the belief state B(ht). Simulations are performed using the partially observable UCT algorithm, as described above. For every history h encountered during simulation, the belief state B(h) is updated to include the simulation state. When search is complete, the agent selects the action at with greatest value, and receives a real observation ot from the world. At this point, the node T(htatot) becomes the root of the new search tree, and the belief state B(htao) determines the agent’s new belief state. The remainder of the tree is pruned, as all other histories are now impossible. The complete POMCP algorithm is described in Algorithm 1 & Figure 1.
[That is, to solve a problem harder than problems like Go, where you can see everything, apply posterior sampling: simply randomly sample a possible state of the world, weighted by its prior probability, and then plan your next action with MCTS as if that state were guaranteed to be the case; do this repeatedly, and pick the action with the best average value across all the plans. As PSRL/Thompson sampling, this balances exploration-exploitation while incorporating uncertainty into choice of actions.]