×
all 94 comments

[–]parzivalml 15 points16 points  (0 children)

Some good points by Alex Irpan here: https://www.alexirpan.com/2018/11/27/go-explore.html

[–]NubFromNubZulund 25 points26 points  (1 child)

The definition of a “cell” corresponding to a downsampled 11x8 screen looks like a very domain specific hack to me. Sure, it’ll work in Montezuma where the majority of change in pixel intensity is driven by Panama Joe moving around (i.e. it just becomes a proxy for his world position) but how would it go in 3d games for example?

[–]joosthuizinga 20 points21 points  (0 children)

Poorly, in all likelihood. The method of downsampling the screen does not require domain knowledge in the sense that you can apply it to all Atari games, but it probably won't work well on all Atari games. Honestly, we were quite surprised by the fact that the downsampled screen worked as well as it did on Montezuma's Revenge, as we expected that a more sophisticated state representation would have been required. Also, to add to your point, the downsampled screen did not work well on Pitfall, which is still simpler than a 3d game.

However, there has been a vast amount of literature on better state-agnostic state representations, and we are anxious to try some of them in combination with Go-Explore to see if we can apply the algorithm to a larger variety of domains.

[–]darkconfidantislife 35 points36 points  (13 children)

Okay wait wtf, they're not using sticky actions?

EDIT: They updated and it still works well with sticky actions, see below

[–]parzivalml 19 points20 points  (0 children)

It seems they are using a deterministic environment: https://twitter.com/sherjilozair/status/1067154816936235009.

[–]jclune 2 points3 points  (1 child)

2nd Update: Go-Explore when robustified with sticky actions on Montezuma’s Revenge scores an average of 281,264 (level 18) with domain knowledge (33,836 without). On Pitfall, the average score with domain knowledge is 20,527 with a max of 64,616 (!) All SOTA. Blog updated. https://eng.uber.com/go-explore/

[–]darkconfidantislife 2 points3 points  (0 children)

Nice, thank for the update! Go-Explore has a lot of similarities with Fractal Monte Carlo, I think they would be a good citation!

[–]AdrienLE1 2 points3 points  (0 children)

We updated our blog post to discuss this issue: http://eng.uber.com/go-explore/ (see the update at the bottom).

TL;DR: for testing using sticky actions, just testing our networks which were not robustified with them already puts us significantly above the state of the art. We are working on robustifying with sticky actions right now and expect to our testing performance with sticky actions to get even better when that's done. For training with sticky actions the whole time, we don't do that (yet), however we believe that many RL applications in the foreseeable future will make use of simulators, in which case it is ok to exploit the properties of simulators like their determinism at some point during training as long as the final network is robust to stochasticity like in our own case.

[–]joosthuizinga 5 points6 points  (8 children)

There has been a lot of work on Atari that has not been tested with sticky actions, so it is not that uncommon. That said, we have no reason to believe we would not be able to robustify a policy that works with sticky actions, we just need to add it to the imitation learning algorithm.

[–]deepML_reader 17 points18 points  (2 children)

It's not about whether they can robustify the policy with sticky actions, it's about whether they could have trained the policy using sticky actions from the start.

[–]joosthuizinga 5 points6 points  (1 child)

While I am optimistic, whether or not we can deal with stochasticity, including sticky actions, at training time is a separate question that we will hopefully be able to answer in the future. For now, we show that we can make progress on really difficult problems provided that a deterministic environment is available at training time, something that has not been shown by anyone else.

[–]parzivalml 11 points12 points  (4 children)

Those works don't exploit determinism like you do. Sticky actions and stochastic frame skips were introduced specifically to prevent algorithms that exploit the deterministic nature of ALE.

[–]joosthuizinga 13 points14 points  (3 children)

No-ops, sticky actions, and stochastic frame skips were added to prevent DNN policies from overfitting to a deterministic environment, and they were aimed at making these policies more robust.

We also aim to train robust policies, but we explicitly separate the problems of exploration and robustification by exploring in a deterministic environment and to only add stochasticity later.

This assumes that a deterministic environment is available at training time, and this is an assumption that is not made in most previous work. However, we believe that it is an assumption that holds for many practical problems, namely all problems for which a simulator is available, meaning the algorithm has interesting practical uses.

That said, working directly on stochastic environments by using a policy to return to previously visited states is a problem that we are hoping to tackle in future work.

[–]parzivalml 5 points6 points  (0 children)

Thanks, that's a reasonable argument.

[–]Fragore 0 points1 point  (1 child)

Poorly, in all likelihood. The method of downsampling the screen does not require domain knowledge in the sense that you can apply it to all Atari games, but it probably won't work well on all Atari games. Honestly, we were quite surprised by the fact that the downsampled screen worked as well as it did on Montezuma's Revenge, as we expected that a more sophisticated state representation would have been required. Also, to add to your point, the downsampled screen did not work well on Pitfall, which is still simpler than a 3d game.

So, at testing time, was a stochastic (with sticky actions) environment used, or was it still deterministic? Because it does not seems so from the post. So in case, as also pointed out here, you are comparing with algorithms that use it. This would mean that you are benchmarking against algorithms that solve a totally different problem. Is it right? Or am I missing something?

[–]joosthuizinga 0 points1 point  (0 children)

See my reply on your other comment/question.

[–]MrDoOO 65 points66 points  (32 children)

The algorithm is described as:

  1. Choose a cell from the archive probabilistically (optionally prefer promising ones, e.g. newer cells)
  2. Go back to that cell
  3. Explore from that cell (e.g. randomly for n steps)
  4. For all cells visited (including new cells), if the new trajectory is better (e.g. higher score), swap it in as the trajectory to reach that cell

This is exhaustive search. Also, "Returning to cells" is not trivial by any stretch in environments with any amount of stochasticity... The authors say that in stochastic policies "one can train a goal-conditioned policy [1, 10] that learns to reliably return to a cell". It seems totally intractable to train and maintain separate neural nets for each cell that you want to ever "return to". I am missing something or is this algorithm akin to an exhaustive search with the additional overhead of insane memory requirements and training time?

Yet another silly RL paper that introduces an extremely brittle and domain specific algorithm to be able to say look we "win" on something. The RL community has got to be better than this if we want RL to work in practice.

[–]gwern 14 points15 points  (18 children)

Does exhaustive search do well on MR? I don't recall anyone showing that, and I would expect that to be intractable.

It seems totally intractable to train and maintain separate neural nets for each cell that you want to ever "return to"

It would be goal-conditioned, so presumably only one NN is trained, but is parameterized to take in the 'cell' state representation and is rewarded only when it reaches that.

Still, it's pretty crazy that all they have to do to define 'cells' is... a tabular representation of downscaled images:

To be tractable in high-dimensional state spaces like Atari, Go-Explore needs a lower-dimensional cell representation with which to form its archive. Thus, the cell representation should conflate states that are similar enough to not be worth exploring separately (and not conflate states that are meaningfully different). Importantly, we show that game-specific domain knowledge is not required to create such a representation. We found that what is perhaps the most naive possible cell representation worked pretty well: simply downsampling the current game frame... In particular, we tested a domain-knowledge version of Go-Explore on Montezuma’s Revenge, wherein cells were defined as unique combinations of the x-y position of the agent, the current room, the current level, and the current number of keys held. We wrote simple code to extract this information directly from pixels...

Robustifying the trajectories found with the domain knowledge version of Go-Explore produces deep neural network policies that reliably solve the first 3 levels of Montezuma’s Revenge (and are robust to random numbers of initial no-ops). Because in this game all levels beyond level 3 are nearly identical (as described above), Go-Explore has solved the entire game! In fact, our agents generalize beyond their initial trajectories, on average solving 29 levels and achieving a score of 469,209!...Go-Explore’s max score is substantially higher than the human world record of 1,219,200, achieving even the strictest definition of “superhuman performance.”

The resource usage will be very interesting. EDIT:

All runs were done on single machines each with 22 CPUs and 50GB of RAM

Hah, that's less RAM than I have on this machine. Not bad.

[–]svantana 1 point2 points  (0 children)

All runs were done on single machines each with 22 CPUs and 50GB of RAM

Hah, that's less RAM than I have on this machine. Not bad.

Well, given that the entire game fits in a 6 kb file [1], it's not that parsimonious...

[1] http://www.atarimania.com/game-atari-2600-vcs-montezuma-s-revenge_7986.html

[–]MrDoOO 1 point2 points  (16 children)

What separates this from exhaustive search? Run a random agent, visit some states, sample promising trajectories, move randomly from those, repeat. Also, being able to reset the state of the simulator to a specific frame is a total cheat that makes the benchmark lose significance entirely...

[–]gwern 7 points8 points  (14 children)

Any exploration strategy sounds worthless and like 'exhaustive search' when you phrase it like that. How do you decide what 'states' to pick? How do you sample promising trajectories? How do you learn from your random moves? It's not exhaustive search because if you try to brute force every possible trajectory in MR, it's probably not going to end well, for the same reason that epsilon-greedy doesn't work.

And it's only a cheat if they can't actually train a NN to reliably reach the goal; if they can, it's just an optimization, as they say, and is no more interesting than, say, OA5 using the API rather than learning from pixels.

[–]deepML_reader 8 points9 points  (6 children)

They have claimed SOTA on this task and so the burden is on them to show that they can actually do it using the same restrictions everyone else uses, rather than us giving them the benefit of the doubt. Having a network that can achieve any specified state in montezuma or pitfall would be amazing and non-trivial.

Whether or not they succeed at making this goal-achieving agent in the future doesn't change the fact that their strong claim is wrong right now, what they have right now is a belief that they can produce a SOTA model.

[–]joosthuizinga 9 points10 points  (5 children)

It is common to have different variants of a particular problem where different assumptions are made. We claim SOTA under the assumption that a deterministic version of the environment is available. We compare against others who do not make this assumption because they hold the current state-of-the-art regardless of whether a deterministic version of the environment is available. To my knowledge, there are currently no other algorithms that obtain higher scores than those reported by us by assuming that a deterministic version of the environment is available during training time.

[–]Fragore 1 point2 points  (3 children)

It is common to have different variants of a particular problem where different assumptions are made. We claim SOTA under the assumption that a deterministic version of the environment is available. We compare against others who do not make this assumption because they hold the current state-of-the-art regardless of whether a deterministic version of the environment is available. To my knowledge, there are currently no other algorithms that obtain higher scores than those reported by us by assuming that a deterministic version of the environment is available during training time.

But this means that you are comparing against algorithms that solve a complete different problem (full stochastic against full deterministic). The benchmarking does not make much sense with these assumptions

[–]joosthuizinga 4 points5 points  (1 child)

There are two issues, and I will address them separately.

The first issue is that we use a random number of no-ops at the start, rather than sticky actions, to evaluate the robustness of our policies after training (which includes both the exploration phase and the robustification phase). Using no-ops to evaluate robustness has been the standard for a long time, which is why we choose it, but we agree that sticky actions are a better test for policy robustness, and we are currently retraining our policies with sticky actions to make the comparison more fair. Thus far, we have tested our policies with sticky actions (but without further training), and we obtain a score of 19,540 (Montezuma's Revenge without domain knowledge) down from 35,410.

The second issue is that we assume access to a deterministic version of the world at training time. The general idea is that the Atari games are a proxy for difficult, real-world problems like robot control, and the deterministic version of the game represents a simulator of that real-world problem. We believe that such a simulator is available for many real-world problems, but we also acknowledge that there are plenty of problems where such a simulator is not readily available.

We show results on the version of the problem where a simulator is available, but we believe we have to compare against previous state-of-the-art that trains directly in a stochastic environment, because a technique that can be applied directly in a stochastic can obviously also be applied in cases where a deterministic simulator is available.

With that said, we have not been very clear in explaining either of these points in the blog post, and we are working on clarifying them.

[–]Fragore 1 point2 points  (0 children)

Thanks for the clarification!

[–]jclune 0 points1 point  (0 children)

2nd Update: Go-Explore when robustified with sticky actions on Montezuma’s Revenge scores an average of 281,264 (level 18) with domain knowledge (33,836 without). On Pitfall, the average score with domain knowledge is 20,527 with a max of 64,616 (!) All SOTA. Blog updated. https://eng.uber.com/go-explore/

[–]MrDoOO 9 points10 points  (6 children)

This work literally solves an easy version of the problem with exhaustive search and then uses imitation learning to match the policy. The difficulty in RL is learning the first policy to begin with. Hacking a simulator to make the problem easy is just a cheat. What this paper shows is that imitation learning works...which we already knew.

[–]gwern 10 points11 points  (4 children)

This work literally solves an easy version of the problem with exhaustive search and then uses imitation learning to match the policy.

It's not exhaustive search if they don't 'exhaust' the search space. Which they probably don't. And you haven't given any examples of someone solving this 'easy version' of MR before and I am skeptical that MR has such a small game tree it can be brute-forced in the way that you claim it can. Does even MCTS on the emulator RAM work? I don't recall Guo getting any MR results worth remembering... (And in what sense are DRL agents not always engaged in 'imitation learning' of successful trajectories, one wonders.)

Again, if you can in fact train a NN to reliably reach a specified goal, why is this approach of learning interesting points, using an option to travel to it and resuming exploration, then learning from gathered experiences, any less legitimate than, say, Neural Episodic Control?

[–]MrDoOO 9 points10 points  (3 children)

[–]gwern 12 points13 points  (2 children)

They don't explain it very well, nor do they back up your claims. Again, if you can train a NN to reach specified states, why is this illegitimate? If it's so easy to solve with basic planning, why has no one done it? Why don't MCTS and other planning approaches work? Are world models/deep environment models illegitimate and useless for planning if they aren't inherently stochastic? These are not hard questions for you to answer.

[–]sherjilozair 8 points9 points  (0 children)

Again, if you can train a NN to reach specified states, why is this illegitimate?

Because this is not easy to do, in terms of sample efficiency. In this context, the states we care to reach are novel states, which by definition we only have a few (probably one) examples of. Goal-conditional policies would require more than that to reliably reach the goal. The data requirements to train such a goal-conditional policy could blow up, and the burden of proof that it does not rests on the authors.

MCTS is a reward-dependent planning scheme, and also breadth-first in spirit, and thus quite unsurprisingly didn't work for MZR. I do believe depth-first search with some heuristics should be able to score as much. I am planning to try this in the near future.

EDIT: I realize that go-explore essentially is depth-first search with heuristics (state abstraction via downsampling).

[–]thebackpropaganda 1 point2 points  (0 children)

Are world models/deep environment models illegitimate and useless for planning if they aren't inherently stochastic?

Yes, unless the environment is nearly-deterministic.

This is why researchers are working on stochastic world models work which do indeed work much better than deterministic models, such as CGQN and citations within.

[–]joosthuizinga 9 points10 points  (0 children)

Yes, we definitely solve an easier version of the problem first, and then use imitation learning to create a policy; this is one important insight that we want to put forward in this blog post. And while it may seem that making the simulator deterministic is a cheat, it is important to note that this is a cheat which is available in almost every single simulator (long live pseudo-random number generators) meaning it is practically applicable. In addition, to the best of my knowledge, nobody has managed to solve the easy version of the problem.

As for being exhaustive search, Go-Explore definitely resembles a form of exhaustive search. However, an exhaustive search in the original state-space (i.e. every single combination of the player, the hazards, and the items at different locations in the game) is intractable. As such, one primary insight is to conflate the search space by mapping many different states to the same cell.

Now you get a different problem: it becomes unclear which actions to take to get from one part of the reduced search space to the other. In our currently implementation, it turns out that just taking random actions is usually sufficient to discover nearby cells, though it is expected that other methods for state conflation will require more intelligent exploration strategies to get from one cell to the next.

[–]ragamufin 0 points1 point  (0 children)

Its not a cheat those are transient states. The point of the model is to solve the problem efficiently, not accurately simulate actions in a video game.

[–]AdrienLE1 30 points31 points  (8 children)

I am missing something or is this algorithm akin to an exhaustive search with the additional overhead of insane memory requirements and training time?

I gave details for training time and memory requirements in these tweets: https://twitter.com/AdrienLE/status/1067176312945631233. I think they are far from insane (in fact they are much lower than those of most other RL algorithm).

"Exhaustive search" is hard to define, but as far as we know, no planning algorithm working directly in the emulator (in fact, no algorithm ever) had reached anywhere near these scores on Montezuma's. To quote another comment I gave here: "The original ALE paper (https://arxiv.org/abs/1207.4708) tried a few planning algorithms, all of which got a flat 0 on Montezuma's."

Yet another silly RL paper that introduces an extremely brittle and domain specific algorithm to be able to say look we "win" on something

We do generate brittle trajectories during training, but then robustify them using imitation learning and obtain robust policies.

The only real requirement of the current version, which indeed relies on determinism for phase 1 and only becomes robust in phase 2, is to train in a simulator, since any computer-based simulator can be made deterministic by setting its random seed. This means you can't use this version of the algorithm if you want to learn 100% in the real world, but then again learning in the real world is also often impractical with conventional RL algorithms due to their huge sample complexity, which is why most people use simulators.

[–]guicho271828 18 points19 points  (3 children)

Optimization & graph search expert here. This approach is Frontier Search (Korf 1999 IJCAI) or RBFS (Korf 1993, Artificial Intelligence) or somewhere in the middle, combined with RL.

It is wrong to say it can always go back to the cell, even in a deterministic setting. Consider a non-replenishable resources like fuel. At some point it becomes impossible to go back to a certain cell due to the environmental constraint.

Edit: So, about "Exhaustive search", the more appropriate, established term for it is "systematic search". Since the algorithm actually presented is stochastic and randomized, as well as seemingly having only a fixed amount of cell memory, it does not guarantee completeness and is not fully systematic. Completeness = the ability to find a solution in finite time whenever it is reachable, under a deterministic setting -- in a nondeterministic setting, this is replaced by probabilistic completeness.

[–]rhaps0dy4 14 points15 points  (0 children)

Montezuma's Revenge has always been easier with search. Resetting the environment is very powerful. In 2015, just before DQN came around, Lipovetzky, Ramírez and Geffner already showed 540 score using their Iterated Width, domain-independent algorithm.

For my bachelor thesis in 2016, by cheating, I greatly improved this to 14900 using a few domain-specific heuristics. Namely, don't explore the same position in the same screen twice (except if you're stuck dying continuously, to allow the algorithm to wait for obstacles that are only passable in intervals), and penalise opening some of the doors (the algorithm was too myopic and wasted keys on them). This run got to get the maximum score possible without finishing the first level. Here's a video.

[–]yazriel0 1 point2 points  (1 child)

Can u recommend a modern survey (or practical guide!) for these techniques ?

My problem domain is really just a weird and specific find-optimal-sub-graphs type search. I got caught up in the AlphaGo hype and now we have MTCS + nueral net heuristic + plenty of CPU + manual features. It sort of works. But I never really tested more traditional algorithms.

[–]NubFromNubZulund 12 points13 points  (2 children)

Actually, a second BIG requirement is that the stochasticity in the environment is relatively minor (as it is with sticky actions). Imagine you’re training a poker bot to play heads up against a computer opponent. The cards dealt to each player and the cards that come down on the flop, etc are determined by the random seed. If you try your approach on this game, it will just memorise which hands are winning; it won’t actually learn how to beat the computer with proper poker skills.

[–]joosthuizinga 5 points6 points  (1 child)

You are right, if there are certain states which can only be reached with some random seeds, but not others, Go-Explore may find trajectories to those states, but an imitation learning algorithm will not be able to reliably revisit those states because the environment itself prevents you from visiting those states reliably.

One possible strategy for dealing with environments like that is to run the exploration phase multiple times with different seeds. This way, you may be able gather enough trajectories such that you have good examples of what to do in very different scenarios.

That said, I don't think poker is the best domain for an algorithm like Go-Explore.

[–]NichG 2 points3 points  (0 children)

You could just add a filter such that states which prove difficult to revisit via imitation are removed from the archive.

This should also partially resolve training in inherently stochastic environments as well, since you could basically explore around the 'reliably repeatable' subpart of the problem space.

[–]jedi-son 0 points1 point  (0 children)

You are right, if there are certain states which can only be reached with some random seeds, but not others, Go-Explore may find trajectories to those states, but an imitation learning algorithm will not be able to reliably revisit those states because the environment itself prevents you from visiting those states reliably.

Similar question to the others but let me word it more specifically: Does this algorithm maintain performance on never before seen levels (ie out of sample)?

[–]Antonenanenas 5 points6 points  (3 children)

I think you are missing something, you do not need separate neural nets for each cell that you want to return to. Using Hindsight experience replay you can train one policy that can be used for any goal. You simply feed the goal as an additional input to the network and apply some training tricks to make it work.

The memory requirements are not necessarily "insane", as they downsample the input image drastically. But I agree, the state space might still be large and it is bad that we do not have any info on how time and memory efficient this algorithm is.

This algorithm might be a bit brittle and domain specific as of now, but don't you think that this can still count as a proof of concept to build future work on?

[–]seann999 2 points3 points  (0 children)

Also (earlier work): UVFAs

[–]AdrienLE1 1 point2 points  (0 children)

Thanks, I forgot to address this. Yes this is absolutely true, this would likely use a single goal conditioned neural net policy with a single neural net that takes in the target cell as the input, not a plethora of neural nets.

Indeed there is a lot of prior work for this for this with HER and UVFAs.

[–][deleted] 25 points26 points  (6 children)

It is nonsensical to say *we will add noise later* in an RL problem

[–]seann999 13 points14 points  (3 children)

Here's a similar thread https://twitter.com/tejasdkulkarni/status/1067136994344620033

I do smell something kinda fishy overall, but we're gonna need a paper with more details than a blog post.

[–]MrDoOO 15 points16 points  (2 children)

"Here we realized that [stochasticity] hurts exploration, so we can first solve a deterministic version THEN robustify. That's much better" - Author

I don't think the author realizes that you can't turn off stochasticity on real domains. Color me shocked that solving hard domains like MR becomes easy when you turn off stochasticity...

[–]AdrienLE1 24 points25 points  (0 children)

In robotics (which I would argue is a real domain), it is common to do the first phase of training in a simulator, where stochasticity can be turned off. Of course, you eventually need to become robust to stochasticity, but the second phase of Go-Explore does this successfully.

[–]probablyuntrue 12 points13 points  (0 children)

> if we ignore one of the chief problems of the field we can beat SOTA

congrats to the authors I guess lol

[–]joosthuizinga 7 points8 points  (1 child)

Maybe when you consider the space of all possible RL problems.

However, when it comes to RL applications like robotics or video games, we almost always have some kind of computational simulation, and this computational simulation can be made deterministic by choosing one particular random seed first, and noise can be added later by choosing different random seeds. The resulting two phase problem may no longer be an RL problem in the strictest sense of the word, but it is a problem that has lots of practical applications if you can solve it effectively.

[–][deleted] 8 points9 points  (0 children)

This is a simulator trick and not a progression of RL state of the art

[–]outlacedev 5 points6 points  (1 child)

It looks like only the "domain knowledge" versions of the algorithm are significantly better than SOTA. Domain knowledge meaning they hard-coded a script to extract features from the states. It seems by manually extracting the features they've sort of turned Montezuma's Revenge into Gridworld.

[–]dod_worker 1 point2 points  (0 children)

The authors are basically using "domain knowledge" as a euphemism for "hand-crafted feature extraction", which is potentially their biggest issue. The blog post would not nearly be as interesting as if the authors just stated bluntly: "we used hand-crafted features for MR and got better performance".

Anytime you use hand-crafted features, you are of course going to get much better performance.

[–]michael-relleum 3 points4 points  (16 children)

Interesting. I've seen Fractal AI for quite some time, but always understood it as "just" a very sophisticated MonteCarlo Search. How does this compare with the new RND from OpenAI? Which one trains faster / get's higher score?

Edit: Never mind, this has nothing to do with Fractal AI. Still my question stands, is this faster / get's higher scores than Random Network Destillation? (I mean for environments other than Montezuma and Pitfall, for example Packman or Breakout?)

[–]yazriel0 1 point2 points  (1 child)

They seem to use (and cite) an RND network in the "imitation learning" phase of the algorithm.

But i agree that we want to compare the "exploration" phase to existing MCTS variants.

[–]AdrienLE1 5 points6 points  (0 children)

We aren't aware of any prior planning algorithm that gets anywhere near to Go-Explore's scores, even when planning in the emulator. The original ALE paper (https://arxiv.org/abs/1207.4708) tried a few planning algorithms, all of which got a flat 0 on Montezuma's.

[–]Antonenanenas 1 point2 points  (1 child)

Yes it gets higher scores, it says so in the blog. RND gets about 15k points, whereas this approach without domain specific knowledge gets about 36k. Look at the figure where the x-axis is the year and the y-axis the score

[–]michael-relleum 5 points6 points  (0 children)

Yes, for Montezuma, but how well does this work for other environments? Is this a general advancement in RL Learning (like PPO was), or is is just exhaustive search, made-to-measure for Montezuma and Pitfall?

[–]AdrienLE1 3 points4 points  (1 child)

Random Network Distillation is the "RND" point in our comparisons to SOTA (see the graphs in the blog post). The answer is that RND gets around 11,500 points on average, whereas our non-domain knowledge version gets over 35,000 and our domain knowledge version gets over 450,000, so yes, this way outperforms it.

As far as we know, Go-Explore beats all current RL algorithms on Montezuma's Revenge.

[–]NubFromNubZulund 1 point2 points  (0 children)

How many emulator steps does it require though if you don’t use state loading?

These latest papers on “sota” results in Montezuma’s Revenge both make very little mention of sample complexity. The top RND score used 16B frames of experience (100x more than previous sota agents) and I’m guessing yours takes even more if you actually took the steps required to return to a state into account.

Surely the main aim of exploration methods is sample efficiency!

[–]darkconfidantislife 0 points1 point  (9 children)

To clarify, if you look at what "fractal monte carlo" does, it's quite similar to this. I'm not saying they're related, sorry for the confusion

For example, choose some state with random exploration and then clone to it if it is better, done in both this and "fractal monte carlo"

[–]michael-relleum 0 points1 point  (8 children)

Oh, I know. But wasn't the "problem" with FAI that it didn't really learn anything per se, it just tried out hundreds of variations each time and choose the best one? Meaning it worked fast the first time but had to do the calculation again and again on every run? Is this different for Go-Explore?

[–]darkconfidantislife 2 points3 points  (7 children)

Is Go-Explore actually doing any learning?

EDIT: actually looks like they are with using deep nets for imitation learning of the generated trajectory paths. So basically a lot like FAI + imitation learning plus some domain knowledge for PR purposes for their 2M score version

[–][deleted]  (6 children)

[deleted]

    [–]tihokan 2 points3 points  (4 children)

    we are not planning to publish that work so we can avoid more "funny coincidences" like this one

    A more positive attitude would be to publish your work to inspire others, without worrying too much over people potentially re-using your ideas without proper credit. Most ideas keep being re-discovered independently anyway, so holding on to them is usually the best way to get scooped.

    [–][deleted]  (3 children)

    [deleted]

      [–]tihokan 2 points3 points  (2 children)

      I actually read about Fractal AI first through the Fractal AI recipe blog post. Unfortunately I found it rather unclear and it basically sounded to me like some variant of Monte-Carlo sampling requiring the ability to easily reset the state, with no learning ability. My feeling was that it might have some potential but I’d wait to see if anything convincing enough came out of it before trying to really understand it.

      I honestly can’t tell whether this stuff is as exciting as you make it sound, but there’s one thing I know for sure: if it is, then you failed to communicate it properly to the RL research community, as otherwise it would be a lot more popular. It sounds like you’ve already got some peer reviews and are familiar with the relevant RL literature, so I guess you should have some ideas on how to better present it, but in case you want some extra advice feel free to send me your latest version (I’m used to reviewing RL papers).

      And btw thanks for sharing that doc, keeping track of Atari results has always been a pita :/

      [–]miau-db 2 points3 points  (1 child)

      You are right, and that kind of healthy skepticism is needed to survive yhe huge amount of literature published tbout RL lately.

      FAI is nice, but it is only a tool. I actually think that what is extraordinary and has the potential to change how people train RL agents is the approach of combining planning and DL (how the planning is performed is just a minor detail).

      The GO-explore algorithm is amazing enough on its own (as you can check they crush all the literature we found). Trying to overhype it will just end up making people think that GO-explore is another "Microsoft's Pacman", when they are in fact offering a different approach to train DNNs that solve decision problems.

      Sadly, the hype is something that may influence how seriously people takes this kind of work. I would love seeing more researchers working on the topic, and the Uber guys have something that can be "convincing enough" to make people care about "planning + DL". I just hope that they make the effort to do science right, so people like you feels motivated to start "hacking the environment" with planning algorithms.

      [–]tihokan 0 points1 point  (0 children)

      Oh I'm all for combining planning + DL for RL, and I doubt there are many skeptics left regarding the potential of such a combination, after seeing AlphaGo/Zero results. I also think there's a healthy amount of people working on model-based RL, though progress has been slow due to how hard it is to build reliable models of the environment. And I very much like the idea of leveraging the specific properties of simulators (vs. the real world) to improve on these, as long as the resulting limitations are clearly stated and there are demonstrated benefits on practical applications (regarding Uber's work, I'm waiting for the paper to be available to judge whether this is the case).

      [–]darkconfidantislife 1 point2 points  (0 children)

      Hey fractal ai looks really cool, would love to talk, PM me?

      [–]darkconfidantislife 2 points3 points  (0 children)

      By the ways, this seems to have a lot of similarities to "fractal monte carlo": https://arxiv.org/pdf/1807.01081.pdf

      http://entropicai.blogspot.com/2018/03/fractal-ai-recipe.html

      [–]TotesMessenger 3 points4 points  (0 children)

      I'm a bot, bleep, bloop. Someone has linked to this thread from another place on reddit:

       If you follow any of the above links, please respect the rules of reddit and don't vote in the other threads. (Info / Contact)

      [–]question99 2 points3 points  (0 children)

      Emphasis added by me:

      Robustifying the trajectories found with the domain knowledge version of Go-Explore produces deep neural network policies that reliably solve the first 3 levels of Montezuma’s Revenge (and are robust to random numbers of initial no-ops). Because in this game all levels beyond level 3 are nearly identical (as described above), Go-Explore has solved the entire game!

      Now the question is: could this technique be improved to automatically infer what was encoded as domain knowledge?

      [–]unguided_deepness 2 points3 points  (0 children)

      The finding seems pretty obvious to me. When you play/speedrun a difficult game, you save right before a difficult sequence, and if you fail, you restart from that save state. Once you master all the difficult sequences, you try to play the game from beginning again and try to optimize it.

      [–]kaledivergence 2 points3 points  (0 children)

      Uber just released a note on the issue of stochasticity at the bottom of the post.

      [–]yazriel0 2 points3 points  (2 children)

      Setting aside the concerns about the generality and reset-to-state aspects.

      Is this algorithm a major improvement with respect to CPU resources? There is no back-prop, so no GPU, and some good results with just 108 frames on a hard-exploration Atari game.

      [–]AdrienLE1 5 points6 points  (0 children)

      I gave a bit of data about resources in these tweets: https://twitter.com/AdrienLE/status/1067176312945631233, more data is coming in the paper.

      Paradoxically, the more computationally intensive part currently is the imitation learning in phase 2, but we think this is largely due to the algorithm we happened to choose (in principle, any imitation learning algorithm that can learn from trajectories would be usable for phase 2).

      [–]deepML_reader 0 points1 point  (0 children)

      Does the 10^8 number take into account the number the actions you have to replay to get back to the same state?

      [–]rlstudent 2 points3 points  (11 children)

      I really liked the idea.

      I don't think it's a "hack" or anything like that. It works on any deterministic (or "deterministicfiable") environment.

      I should read the alpha go paper, but from what I remember reading about it, didn't they use a similar approach? Phase 1 looks like Monte Carlo Tree Search to me, is it different? In this one you can use the game score directly, but with alpha go they used a trained value function to guide a MCTS if I'm not mistaken. They also didn't need to robustify.

      [–]NubFromNubZulund 22 points23 points  (10 children)

      The big difference between AlphaGo and most reinforcement learning agents for Atari is that AlphaGo assumes access to a predictive model. (Given a board state and a player action, it knows exactly what the next state will be.) In Go this assumption is just equivalent to assuming knowledge of the rules, so it’s not a big deal. However, in Atari games, assuming knowledge of the game’s precise model is a lot more “cheaty”. For this reason, most previous RL agents for Atari use model-free techniques. They’re generally weaker than methods that have access to an exact model, though this is hardly surprising. GoExplore effectively assumes access to an exact model, so comparing its scores against those of model-free agents is pretty misleading. (And the authors’ explanation that they do so because model-free agents were the previous SOTA is disingenuous. If other researchers thought it was fair to exploit access to the exact model, there would no doubt be stronger benchmarks available by now.)

      What’s irking me and many other posters is that these differences in assumption should be made very clear upfront. Instead they lead with the fact that their algorithm achieves scores orders of magnitude higher than previous agents, and now a bunch of non-experts are celebrating. It’s a clear example of the kind of overhype pervading ML at the moment, and Uber seem to be one of the most guilty parties. If you want more evidence, check out “Welcome to the Era of Deep Neuroevolution”: https://eng.uber.com/deep-neuroevolution/?amp. The title is so hyped that it almost sounds like irony, but at least the ICLR reviewers are seeing through the hype: https://openreview.net/forum?id=HyGh4sR9YQ

      [–]NichG 8 points9 points  (5 children)

      Ultimately, the goal is to solve problems. The RL literature has a bad habit of making problems artificially difficult until RL is the only thing left that can solve them - for example, people still talk about cartpole when a zero-learning PID controller can solve it with zero examples.

      So I think it's legitimate to question that decision, and to explore variations from things that the field considers 'fair'. At least in cases where those variations correspond to practical use-cases that people might otherwise be forced to spend a much greater amount of computational or real resources to solve using a pure RL technique.

      If e.g. I were wanting to develop AI opponents for quickly evolving competitive online games, then I'd be much better off knowing that given access to a simulator and the ability to render parts of the game deterministic, an exploratory branching search might save me orders of magnitude of effort compared to trying to push through with PPO or something like that.

      [–]NubFromNubZulund 18 points19 points  (4 children)

      You make some very good points. It’s starting to feel like there are two divergent subfields within RL. There’s the “how do humans learn” group, which cares about issues like sample efficiency (because humans clearly don’t require years’ worth of experience to learn games like Montezuma) and there’s the “how can we solve real problems” group, that doesn’t blink when firing up a thousand parallel actors or fundamentally relying on access to a simulator. For researchers in the first group, the reason for introducing artificial difficulty isn’t just to be a dick; it’s because we’re not really interested in solving Montezuma’s Revenge, but rather in building general agents that make as few assumptions as possible.

      [–]TheFlowzilla 4 points5 points  (2 children)

      because humans clearly don’t require years’ worth of experience to learn games like Montezuma

      Well but they do need that. At what age would a baby be able to learn Montezuma, probably at like 2-3 years old? It's just that they don't need years of experience of playing Montezuma but are able to do transfer learning.

      [–]NubFromNubZulund 5 points6 points  (1 child)

      Fair enough, I should have said "task-specific experience". My point remains though that we definitely don't learn tasks by saving and re-loading states, and no machine that relies on this mechanism could be said to have true AGI. Your point just highlights that there's a lot more work to be done in transfer learning :)

      [–]ragamufin 0 points1 point  (0 children)

      Humans do save and reload states all the time in video games to learn how to complete a task.

      Isn't a batting cage a great example of saving and re-loading states in the real world? You might take 5-6 pitches in a baseball game but you can take 200 in a row in a batting cage and then take that back to the larger game.

      [–]ragamufin 1 point2 points  (0 children)

      This is a great and concise way of explaining a divergence that has become more visible in the past year or two.

      [–]rlstudent 1 point2 points  (0 children)

      Yeah, I think it should be made way more clear, and I also agree that the comparisons are unfair.

      But although it may be overhyped, I think the paper is an achievement in reminding us the usefulness of planning. POLO (https://sites.google.com/view/polo-mpc) also combined planning with RL and had outstanding results (with no tricks I think... I mean, the usage of mujoco for MPC, maybe, but mujoco was created for this).

      [–]jclune 4 points5 points  (2 children)

      As Adrien says below, "We aren't aware of any prior planning algorithm that gets anywhere near to Go-Explore's scores, even when planning in the emulator. The original ALE paper (https://arxiv.org/abs/1207.4708) tried a few planning algorithms, all of which got a flat 0 on Montezuma's." To repeat: researchers have already tried to take advantage of a perfect model (the emulator), including using MCTS, and failed on both Montezuma's Revenge and Pitfall. We thus think we are comparing to the best algorithms ever produced on this domain (which are model-free), and thus that the comparison is fair. We also think given the significant effort that has been put into trying to solve these domains (both with model-based and model-free methods), and the significant improvement in results provided by Go-Explore, it is reasonable to highlight the size of the advance to alert readers that there is an effective new technique here so they can decide whether to spend the time to read the rest of the post and learn more about how these results were achieved.

      [–]NubFromNubZulund 12 points13 points  (1 child)

      My problem isn't just with you over-hyping your own method, it's with you making previous deep RL methods seem weak by way of an unfair comparison. This image is particularly misleading:

      https://eng.uber.com/wp-content/uploads/2018/11/mont_sota_1_header-412x420.png

      The other methods in that graph are *Reinforcement Learning* algorithms. Go-Explore is not. As per Wikipedia: "The main difference between the classical dynamic programming methods and reinforcement learning algorithms is that the latter do not assume knowledge of an exact mathematical model of the MDP..." Your method effectively leverages such knowledge by saving and reloading states.

      Nonetheless, your blog post opens with: "In deep reinforcement learning (RL), solving the Atari games Montezuma’s Revenge and Pitfall has been a grand challenge."

      This strongly implies that your approach is a deep reinforcement learning algorithm, and that the RL community will now consider these problems solved. Neither of these are true. Go-Explore is a planning algorithm that leverages strong heuristic knowledge (provided either by a human or a grid encoding that hardly seems generalisable), coupled with a *supervised* learning algorithm. If you pitched it this way then I would have no real problem except that it seems a bit reliant on the heuristic. But it's unfair on RL researchers to claim that you've solved a grand challenge in RL when your approach isn't even a reinforcement learning algorithm (in the usual sense that everyone in the field understands). I'm certainly going to continue working on these problems.

      [–]ragamufin 0 points1 point  (0 children)

      That chart is comical. They break the axis to show their 2mil score but then scale the first section so that every method looks like zero except go-explore. This chart should never have existed.

      Didn't they use something that looks like RL to produce the heuristic knowledge for go-explore though? It seems like a hybrid approach to me.

      [–]oldmonk90 -1 points0 points  (0 children)

      Good work, can't wait until the AI starts beating speed records on 3d games.

      [–]crespo_modesto -4 points-3 points  (0 children)

      AI solved dysentery? wow! haha /s

      sorry