ā€œAdversarial Policies Beat Superhuman Go AIsā€, Tony T. Wang, Adam Gleave, Tom Tseng, Kellin Pelrine, Nora Belrose, Joseph Miller, Michael D. Dennis, Yawen Duan, Viktor Pogrebniak, Sergey Levine, Stuart Russell2022-11-01 (, , )⁠:

[talk, summary, background] We attack the state-of-the-art Go-playing AI system KataGo by training adversarial policies against it, achieving a >97% win rate against KataGo running at superhuman settings.

Our adversaries do not win by playing Go well. Instead, they trick KataGo into making serious blunders. Our attack transfers zero-shot to other superhuman Go-playing AIs, and is comprehensible to the extent that human experts can implement it without algorithmic assistance to consistently beat superhuman AIs.

The core vulnerability uncovered by our attack persists even in KataGo agents adversarially trained to defend against our attack.

Our results demonstrate that even superhuman AI systems may harbor surprising failure modes.

Example games are available https://goattack.far.ai/.


[Yet another discovery driven by abundant compute:]

Adam Gleave: ā€œā€¦I was very unsure going in how easy they’d be to find…In fact, although the method we used is fairly simple, actually getting everything to work was non-trivial. There was one point after we’d patched the first (rather degenerate) pass-attack that the team was doubting whether our method would be able to beat the now stronger KataGo victim.

We were considering cancelling the training run, but decided to leave it going given we had some idle GPUs in the cluster. A few days later there was a phase shift in the win rate of the adversary: it had stumbled across some strategy that worked and finally was learning.ā€