[R] Montezuma’s Revenge Solved by Go-Explore, a New Algorithm for Hard-Exploration Problems (Sets Records on Pitfall, Too)

parzivalml · 2018-11-27T17:06:27+00:00

Some good points by Alex Irpan here: https://www.alexirpan.com/2018/11/27/go-explore.html

NubFromNubZulund · 2018-11-27T00:30:15+00:00

The definition of a “cell” corresponding to a downsampled 11x8 screen looks like a very domain specific hack to me. Sure, it’ll work in Montezuma where the majority of change in pixel intensity is driven by Panama Joe moving around (i.e. it just becomes a proxy for his world position) but how would it go in 3d games for example?

darkconfidantislife · 2018-12-13T02:15:25+00:00

Okay wait wtf, they're not using sticky actions?

EDIT: They updated and it still works well with sticky actions, see below

MrDoOO · 2018-11-26T22:15:33+00:00

The algorithm is described as:

Choose a cell from the archive probabilistically (optionally prefer promising ones, e.g. newer cells)
Go back to that cell
Explore from that cell (e.g. randomly for n steps)
For all cells visited (including new cells), if the new trajectory is better (e.g. higher score), swap it in as the trajectory to reach that cell

This is exhaustive search. Also, "Returning to cells" is not trivial by any stretch in environments with any amount of stochasticity... The authors say that in stochastic policies "one can train a goal-conditioned policy [1, 10] that learns to reliably return to a cell". It seems totally intractable to train and maintain separate neural nets for each cell that you want to ever "return to". I am missing something or is this algorithm akin to an exhaustive search with the additional overhead of insane memory requirements and training time?

Yet another silly RL paper that introduces an extremely brittle and domain specific algorithm to be able to say look we "win" on something. The RL community has got to be better than this if we want RL to work in practice.

seann999 · 2018-11-26T22:07:58+00:00

It is nonsensical to say *we will add noise later* in an RL problem

outlacedev · 2018-11-27T01:48:44+00:00

It looks like only the "domain knowledge" versions of the algorithm are significantly better than SOTA. Domain knowledge meaning they hard-coded a script to extract features from the states. It seems by manually extracting the features they've sort of turned Montezuma's Revenge into Gridworld.

michael-relleum · 2018-11-26T22:26:56+00:00

Interesting. I've seen Fractal AI for quite some time, but always understood it as "just" a very sophisticated MonteCarlo Search. How does this compare with the new RND from OpenAI? Which one trains faster / get's higher score?

Edit: Never mind, this has nothing to do with Fractal AI. Still my question stands, is this faster / get's higher scores than Random Network Destillation? (I mean for environments other than Montezuma and Pitfall, for example Packman or Breakout?)

darkconfidantislife · 2018-11-26T21:45:28+00:00

By the ways, this seems to have a lot of similarities to "fractal monte carlo": https://arxiv.org/pdf/1807.01081.pdf

http://entropicai.blogspot.com/2018/03/fractal-ai-recipe.html

TotesMessenger · 2018-11-26T22:25:34+00:00

I'm a bot, bleep, bloop. Someone has linked to this thread from another place on reddit:

[/r/reinforcementlearning] Uber claims solution of entire _Montezuma's Revenge_ game (reaching 2m points & level 159) and progress on _Pitfall_ with 'Go-Explore'

^{If you follow any of the above links, please respect the rules of reddit and don't vote in the other threads.} ^(Info ^/ ^Contact)

question99 · 2018-11-26T23:58:31+00:00

Emphasis added by me:

Robustifying the trajectories found with the domain knowledge version of Go-Explore produces deep neural network policies that reliably solve the first 3 levels of Montezuma’s Revenge (and are robust to random numbers of initial no-ops). Because in this game all levels beyond level 3 are nearly identical (as described above), Go-Explore has solved the entire game!

Now the question is: could this technique be improved to automatically infer what was encoded as domain knowledge?

unguided_deepness · 2018-11-27T02:59:01+00:00

The finding seems pretty obvious to me. When you play/speedrun a difficult game, you save right before a difficult sequence, and if you fail, you restart from that save state. Once you master all the difficult sequences, you try to play the game from beginning again and try to optimize it.

kaledivergence · 2018-12-04T08:07:57+00:00

Uber just released a note on the issue of stochasticity at the bottom of the post.

yazriel0 · 2018-11-26T22:40:50+00:00

Setting aside the concerns about the generality and reset-to-state aspects.

Is this algorithm a major improvement with respect to CPU resources? There is no back-prop, so no GPU, and some good results with just 10⁸ frames on a hard-exploration Atari game.

rlstudent · 2018-11-27T01:27:42+00:00

I really liked the idea.

I don't think it's a "hack" or anything like that. It works on any deterministic (or "deterministicfiable") environment.

I should read the alpha go paper, but from what I remember reading about it, didn't they use a similar approach? Phase 1 looks like Monte Carlo Tree Search to me, is it different? In this one you can use the game score directly, but with alpha go they used a trained value function to guide a MCTS if I'm not mistaken. They also didn't need to robustify.

oldmonk90 · 2018-11-27T07:46:28+00:00

Good work, can't wait until the AI starts beating speed records on 3d games.

crespo_modesto · 2018-11-27T08:36:20+00:00

AI solved dysentery? wow! haha /s

sorry

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS