“Introducing Superalignment”, 2023-07-05 (; backlinks):
[described as “a Manhattan Project for alignment”] We need scientific and technical breakthroughs to steer and control AI systems much smarter than us. To solve this problem within 4 years, we’re starting a new team, co-led by Ilya Sutskever and Jan Leike, and dedicating 20% of the compute we’ve secured to date to this effort. We’re looking for excellent ML researchers and engineers to join us.
…Currently, we don’t have a solution for steering or controlling a potentially superintelligent AI, and preventing it from going rogue. Our current techniques for aligning AI, such as reinforcement learning from human feedback, rely on humans’ ability to supervise AI. But humans won’t be able to reliably supervise AI systems much smarter than us. Other assumptions could also break down in the future, like favorable generalization properties during deployment or our models’ inability to successfully detect and undermine supervision during training, and so our current alignment techniques will not scale to superintelligence. We need new scientific and technical breakthroughs.
…Our approach: Our goal is to build a roughly human-level automated alignment researcher. We can then use vast amounts of compute to scale our efforts, and iteratively align superintelligence.
To align the first automated alignment researcher, we will need to (1) develop a scalable training method, (2) validate the resulting model, and (3) stress test our entire alignment pipeline:
To provide a training signal on tasks that are difficult for humans to evaluate, we can leverage AI systems to assist evaluation of other AI systems (scalable oversight).
In addition, we want to understand and control how our models generalize our oversight to tasks we can’t supervise (generalization).
To validate the alignment of our systems, we automate search for problematic behavior (robustness) and problematic internals (automated interpretability).
Finally, we can test our entire pipeline by deliberately training misaligned models, and confirming that our techniques detect the worst kinds of misalignments (adversarial testing).
We expect our research priorities will evolve substantially as we learn more about the problem and we’ll likely add entirely new research areas. We are planning to share more on our roadmap in the future.
…The new team: We are assembling a team of top machine learning researchers and engineers to work on this problem.
We are dedicating 20% of the compute we’ve secured to date over the next 4 years to solving the problem of superintelligence alignment. Our chief basic research bet is our new Superalignment team, but getting this right is critical to achieve our mission and we expect many teams to contribute, from developing new methods to scaling them up to deployment.