- See Also
-
Links
- “What If the Robots Were Very Nice While They Took Over the World?”, Heffernan 2023
- “Gödel, Escher, Bach Author Douglas Hofstadter on the State of AI Today § What about AI Terrifies You?”, Hofstadter & Kim 2023
- “Microsoft and OpenAI Forge Awkward Partnership As Tech’s New Power Couple: As the Companies Lead the AI Boom, Their Unconventional Arrangement Sometimes Causes Conflict”, Dotan & Seetharaman 2023
- “Incentivizing Honest Performative Predictions With Proper Scoring Rules”, Oesterheld et al 2023
- “Large Language Models Can Be Used To Effectively Scale Spear Phishing Campaigns”, Hazell 2023
- “Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting”, Turpin et al 2023
- “Mitigating Lies in Vision-Language Models”, Li et al 2023
- “A Radical Plan to Make AI Good, Not Evil”, Knight 2023
- “Even The Politicians Thought the Open Letter Made No Sense In The Senate Hearing on AI Today's Hearing on Ai Covered Ai Regulation and Challenges, and the Infamous Open Letter, Which Nearly Everyone in the Room Thought Was Unwise”, Gorrell 2023
- “In A.I. Race, Microsoft and Google Choose Speed Over Caution: Technology Companies Were Once Leery of What Some Artificial Intelligence Could Do. Now the Priority Is Winning Control of the Industry’s next Big Thing”, Grant & Weise 2023
- “8 Things to Know about Large Language Models”, Bowman 2023
- “Sam Altman on What Makes Him ‘Super Nervous’ About AI: The OpenAI Co-founder Thinks Tools like GPT-4 Will Be Revolutionary. But He’s Wary of Downsides”, Swisher 2023
- “As A.I. Booms, Lawmakers Struggle to Understand the Technology: Tech Innovations Are Again Racing ahead of Washington’s Ability to Regulate Them, Lawmakers and A.I. Experts Said”, Kang & Satariano 2023
- “Pretraining Language Models With Human Preferences”, Korbak et al 2023
- “Conditioning Predictive Models: Risks and Strategies”, Hubinger et al 2023
- “Tracr: Compiled Transformers As a Laboratory for Interpretability”, Lindner et al 2023
- “Discovering Language Model Behaviors With Model-Written Evaluations”, Perez et al 2022
- “Discovering Latent Knowledge in Language Models Without Supervision”, Burns et al 2022
- “Embedding Synthetic Off-Policy Experience for Autonomous Driving via Zero-Shot Curricula”, Bronstein et al 2022
- “Interpreting Neural Networks through the Polytope Lens”, Black et al 2022
- “Mysteries of Mode Collapse § Inescapable Wedding Parties”, Janus 2022
- “Measuring Progress on Scalable Oversight for Large Language Models”, Bowman et al 2022
- “Increments Podcast: #45—4 Central Fallacies of AI Research (with Melanie Mitchell)”, Mitchell & Chugg 2022
- “Scaling Laws for Reward Model Overoptimization”, Gao et al 2022
- “The Alignment Problem from a Deep Learning Perspective”, Ngo 2022
- “Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned”, Ganguli et al 2022
- “Modeling Transformative AI Risks (MTAIR) Project—Summary Report”, Clarke et al 2022
- “Researching Alignment Research: Unsupervised Analysis”, Kirchner et al 2022
- “Ethan Caballero on Private Scaling Progress”, Caballero & Trazzi 2022
- “DeepMind: The Podcast—Excerpts on AGI”, Kiely 2022
- “Do As I Can, Not As I Say (SayCan): Grounding Language in Robotic Affordances”, Ahn et al 2022
- “It Looks Like You’re Trying To Take Over The World”, Gwern 2022
- “Predictability and Surprise in Large Generative Models”, Ganguli et al 2022
- “Uncalibrated Models Can Improve Human-AI Collaboration”, Vodrahalli et al 2022
- “DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers”, Cho et al 2022
- “LaMDA: Language Models for Dialog Applications”, Thoppilan et al 2022
- “Safe Deep RL in 3D Environments Using Human Feedback”, Rahtz et al 2022
- “The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models”, Pan et al 2022
- “Scaling Language Models: Methods, Analysis & Insights from Training Gopher”, Rae et al 2021
- “A General Language Assistant As a Laboratory for Alignment”, Askell et al 2021
- “What Would Jiminy Cricket Do? Towards Agents That Behave Morally”, Hendrycks et al 2021
- “Can Machines Learn Morality? The Delphi Experiment”, Jiang et al 2021
- “Unsolved Problems in ML Safety”, Hendrycks et al 2021
- “SafetyNet: Safe Planning for Real-world Self-driving Vehicles Using Machine-learned Policies”, Vitelli et al 2021
- “An Empirical Cybersecurity Evaluation of GitHub Copilot’s Code Contributions”, Pearce et al 2021
- “On the Opportunities and Risks of Foundation Models”, Bommasani et al 2021
- “Evaluating Large Language Models Trained on Code”, Chen et al 2021
- “Randomness In Neural Network Training: Characterizing The Impact of Tooling”, Zhuang et al 2021
- “Anthropic Raises $124 Million to Build More Reliable, General AI Systems”, Anthropic 2021
- “Goal Misgeneralization in Deep Reinforcement Learning”, Koch et al 2021
- “Artificial Intelligence in China’s Revolution in Military Affairs”, Kania 2021
- “Reward Is Enough”, Silver et al 2021
- “Intelligence and Unambitiousness Using Algorithmic Information Theory”, Cohen et al 2021
- “AI Dungeon Public Disclosure Vulnerability Report—GraphQL Unpublished Adventure Data Leak”, AetherDevSecOps 2021
- “Universal Off-Policy Evaluation”, Chandak et al 2021
- “Multitasking Inhibits Semantic Drift”, Jacob et al 2021
- “Replaying Real Life: How the Waymo Driver Avoids Fatal Human Crashes”, Waymo 2021
- “Language Models Have a Moral Dimension”, Schramowski et al 2021
- “Waymo Simulated Driving Behavior in Reconstructed Fatal Crashes within an Autonomous Vehicle Operating Domain”, Scanlon et al 2021
- “Agent Incentives: A Causal Perspective”, Everitt et al 2021
- “Organizational Update from OpenAI”, OpenAI 2020
- “Emergent Road Rules In Multi-Agent Driving Environments”, Pal et al 2020
- “Recipes for Safety in Open-domain Chatbots”, Xu et al 2020
- “The Radicalization Risks of GPT-3 and Advanced Neural Language Models”, McGuffie & Newhouse 2020
- “Matt Botvinick on the Spontaneous Emergence of Learning Algorithms”, Scholl 2020
- “Aligning AI With Shared Human Values”, Hendrycks et al 2020
- “The Scaling Hypothesis”, Gwern 2020
- “Reward-rational (implicit) Choice: A Unifying Formalism for Reward Learning”, Jeon et al 2020
- “The Incentives That Shape Behavior”, Carey et al 2020
- “2019 AI Alignment Literature Review and Charity Comparison”, Larks 2019
- “Learning Norms from Stories: A Prior for Value Aligned Agents”, Frazier et al 2019
- “Optimal Policies Tend to Seek Power”, Turner et al 2019
- “Taxonomy of Real Faults in Deep Learning Systems”, Humbatova et al 2019
- “Release Strategies and the Social Impacts of Language Models”, Solaiman et al 2019
- “The Bouncer Problem: Challenges to Remote Explainability”, Merrer & Tredan 2019
- “Scaling Data-driven Robotics With Reward Sketching and Batch Reinforcement Learning”, Cabi et al 2019
- “Fine-Tuning GPT-2 from Human Preferences § Bugs Can Optimize for Bad Behavior”, Ziegler et al 2019
- “Designing Agent Incentives to Avoid Reward Tampering”, Everitt et al 2019
- “Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective”, Everitt et al 2019
- “Characterizing Attacks on Deep Reinforcement Learning”, Pan et al 2019
- “Categorizing Wireheading in Partially Embedded Agents”, Majha et al 2019
- “Risks from Learned Optimization in Advanced Machine Learning Systems”, Hubinger et al 2019
- “GROVER: Defending Against Neural Fake News”, Zellers et al 2019
- “AI-GAs: AI-generating Algorithms, an Alternate Paradigm for Producing General Artificial Intelligence”, Clune 2019
- “Challenges of Real-World Reinforcement Learning”, Dulac-Arnold et al 2019
- “DeepMind and Google: the Battle to Control Artificial Intelligence. Demis Hassabis Founded a Company to Build the World’s Most Powerful AI. Then Google Bought Him Out. Hal Hodson Asks Who Is in Charge”, Hodson 2019
- “Forecasting Transformative AI: An Expert Survey”, Gruetzemacher et al 2019
- “Artificial Intelligence: A Guide for Thinking Humans § Prologue: Terrified”, Mitchell 2019
- “Evolution As Backstop for Reinforcement Learning”, Gwern 2018
- “Rigorous Agent Evaluation: An Adversarial Approach to Uncover Catastrophic Failures”, Uesato et al 2018
- “There Is Plenty of Time at the Bottom: the Economics, Risk and Ethics of Time Compression”, Sandberg 2018
- “Better Safe Than Sorry: Evidence Accumulation Allows for Safe Reinforcement Learning”, Agarwal et al 2018
- “The Alignment Problem for Bayesian History-Based Reinforcement Learners”, Everitt & Hutter 2018
- “Adaptive Mechanism Design: Learning to Promote Cooperation”, Baumann et al 2018
- “Visceral Machines: Risk-Aversion in Reinforcement Learning With Intrinsic Physiological Rewards”, McDuff & Kapoor 2018
- “Incomplete Contracting and AI Alignment”, Hadfield-Menell & Hadfield 2018
- “Programmatically Interpretable Reinforcement Learning”, Verma et al 2018
- “Categorizing Variants of Goodhart’s Law”, Manheim & Garrabrant 2018
- “The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities”, Lehman et al 2018
- “Machine Theory of Mind”, Rabinowitz et al 2018
- “Safe Exploration in Continuous Action Spaces”, Dalal et al 2018
- “CycleGAN, a Master of Steganography”, Chu et al 2017
- “AI Safety Gridworlds”, Leike et al 2017
- “There’s No Fire Alarm for Artificial General Intelligence”, Yudkowsky 2017
- “Safe Reinforcement Learning via Shielding”, Alshiekh et al 2017
- “CAN: Creative Adversarial Networks, Generating "Art" by Learning About Styles and Deviating from Style Norms”, Elgammal et al 2017
- “DeepXplore: Automated Whitebox Testing of Deep Learning Systems”, Pei et al 2017
- “On the Impossibility of Supersized Machines”, Garfinkel et al 2017
- “Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks”, Katz et al 2017
- “AI Risk Demos”, Gwern 2016
- “The Off-Switch Game”, Hadfield-Menell et al 2016
- “Combating Reinforcement Learning’s Sisyphean Curse With Intrinsic Fear”, Lipton et al 2016
- “Why Tool AIs Want to Be Agent AIs”, Gwern 2016
- “Concrete Problems in AI Safety”, Amodei et al 2016
- “Complexity No Bar to AI”, Gwern 2014
- “Intelligence Explosion Microeconomics”, Yudkowsky 2013
- “Surprisingly Turing-Complete”, Gwern 2012
- “Advantages of Artificial Intelligences, Uploads, and Digital Minds”, Sotala 2012
- “The Neural Net Tank Urban Legend”, Gwern 2011
- “Ontological Crises in Artificial Agents’ Value Systems”, Blanc 2011
- “Halloween Nightmare Scenario, Early 2020’s”, Wood 2009
- “The Basic AI Drives”, Omohundro 2008
- “Starfish § Bulrushes”, Watts 1999
- “Superhumanism: According to Hans Moravec § On the Inevitability & Desirability of Human Extinction”, Platt 1995
- “Homepage of Paul F. Christiano”, Christiano 2023
- Sort By Magic
- Wikipedia
- Miscellaneous
- Link Bibliography
See Also
Links
“What If the Robots Were Very Nice While They Took Over the World?”, Heffernan 2023
“What If the Robots Were Very Nice While They Took Over the World?”
“Gödel, Escher, Bach Author Douglas Hofstadter on the State of AI Today § What about AI Terrifies You?”, Hofstadter & Kim 2023
“Microsoft and OpenAI Forge Awkward Partnership As Tech’s New Power Couple: As the Companies Lead the AI Boom, Their Unconventional Arrangement Sometimes Causes Conflict”, Dotan & Seetharaman 2023
“Incentivizing Honest Performative Predictions With Proper Scoring Rules”, Oesterheld et al 2023
“Incentivizing honest performative predictions with proper scoring rules”
“Large Language Models Can Be Used To Effectively Scale Spear Phishing Campaigns”, Hazell 2023
“Large Language Models Can Be Used To Effectively Scale Spear Phishing Campaigns”
“Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting”, Turpin et al 2023
“Mitigating Lies in Vision-Language Models”, Li et al 2023
“A Radical Plan to Make AI Good, Not Evil”, Knight 2023
“Even The Politicians Thought the Open Letter Made No Sense In The Senate Hearing on AI Today's Hearing on Ai Covered Ai Regulation and Challenges, and the Infamous Open Letter, Which Nearly Everyone in the Room Thought Was Unwise”, Gorrell 2023
“In A.I. Race, Microsoft and Google Choose Speed Over Caution: Technology Companies Were Once Leery of What Some Artificial Intelligence Could Do. Now the Priority Is Winning Control of the Industry’s next Big Thing”, Grant & Weise 2023
“8 Things to Know about Large Language Models”, Bowman 2023
“Sam Altman on What Makes Him ‘Super Nervous’ About AI: The OpenAI Co-founder Thinks Tools like GPT-4 Will Be Revolutionary. But He’s Wary of Downsides”, Swisher 2023
“As A.I. Booms, Lawmakers Struggle to Understand the Technology: Tech Innovations Are Again Racing ahead of Washington’s Ability to Regulate Them, Lawmakers and A.I. Experts Said”, Kang & Satariano 2023
“Pretraining Language Models With Human Preferences”, Korbak et al 2023
“Conditioning Predictive Models: Risks and Strategies”, Hubinger et al 2023
“Tracr: Compiled Transformers As a Laboratory for Interpretability”, Lindner et al 2023
“Tracr: Compiled Transformers as a Laboratory for Interpretability”
“Discovering Language Model Behaviors With Model-Written Evaluations”, Perez et al 2022
“Discovering Language Model Behaviors with Model-Written Evaluations”
“Discovering Latent Knowledge in Language Models Without Supervision”, Burns et al 2022
“Discovering Latent Knowledge in Language Models Without Supervision”
“Embedding Synthetic Off-Policy Experience for Autonomous Driving via Zero-Shot Curricula”, Bronstein et al 2022
“Embedding Synthetic Off-Policy Experience for Autonomous Driving via Zero-Shot Curricula”
“Interpreting Neural Networks through the Polytope Lens”, Black et al 2022
“Mysteries of Mode Collapse § Inescapable Wedding Parties”, Janus 2022
“Measuring Progress on Scalable Oversight for Large Language Models”, Bowman et al 2022
“Measuring Progress on Scalable Oversight for Large Language Models”
“Increments Podcast: #45—4 Central Fallacies of AI Research (with Melanie Mitchell)”, Mitchell & Chugg 2022
“Increments Podcast: #45—4 Central Fallacies of AI Research (with Melanie Mitchell)”
“Scaling Laws for Reward Model Overoptimization”, Gao et al 2022
“The Alignment Problem from a Deep Learning Perspective”, Ngo 2022
“Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned”, Ganguli et al 2022
“Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned”
“Modeling Transformative AI Risks (MTAIR) Project—Summary Report”, Clarke et al 2022
“Modeling Transformative AI Risks (MTAIR) Project—Summary Report”
“Researching Alignment Research: Unsupervised Analysis”, Kirchner et al 2022
“Ethan Caballero on Private Scaling Progress”, Caballero & Trazzi 2022
“DeepMind: The Podcast—Excerpts on AGI”, Kiely 2022
“Do As I Can, Not As I Say (SayCan): Grounding Language in Robotic Affordances”, Ahn et al 2022
“Do As I Can, Not As I Say (SayCan): Grounding Language in Robotic Affordances”
“It Looks Like You’re Trying To Take Over The World”, Gwern 2022
“Predictability and Surprise in Large Generative Models”, Ganguli et al 2022
“Uncalibrated Models Can Improve Human-AI Collaboration”, Vodrahalli et al 2022
“DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers”, Cho et al 2022
“DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers”
“LaMDA: Language Models for Dialog Applications”, Thoppilan et al 2022
“Safe Deep RL in 3D Environments Using Human Feedback”, Rahtz et al 2022
“The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models”, Pan et al 2022
“The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models”
“Scaling Language Models: Methods, Analysis & Insights from Training Gopher”, Rae et al 2021
“Scaling Language Models: Methods, Analysis & Insights from Training Gopher”
“A General Language Assistant As a Laboratory for Alignment”, Askell et al 2021
“A General Language Assistant as a Laboratory for Alignment”
“What Would Jiminy Cricket Do? Towards Agents That Behave Morally”, Hendrycks et al 2021
“What Would Jiminy Cricket Do? Towards Agents That Behave Morally”
“Can Machines Learn Morality? The Delphi Experiment”, Jiang et al 2021
“Unsolved Problems in ML Safety”, Hendrycks et al 2021
“SafetyNet: Safe Planning for Real-world Self-driving Vehicles Using Machine-learned Policies”, Vitelli et al 2021
“SafetyNet: Safe planning for real-world self-driving vehicles using machine-learned policies”
“An Empirical Cybersecurity Evaluation of GitHub Copilot’s Code Contributions”, Pearce et al 2021
“An Empirical Cybersecurity Evaluation of GitHub Copilot’s Code Contributions”
“On the Opportunities and Risks of Foundation Models”, Bommasani et al 2021
“Evaluating Large Language Models Trained on Code”, Chen et al 2021
“Randomness In Neural Network Training: Characterizing The Impact of Tooling”, Zhuang et al 2021
“Randomness In Neural Network Training: Characterizing The Impact of Tooling”
“Anthropic Raises $124 Million to Build More Reliable, General AI Systems”, Anthropic 2021
“Anthropic raises $124 million to build more reliable, general AI systems”
“Goal Misgeneralization in Deep Reinforcement Learning”, Koch et al 2021
“Artificial Intelligence in China’s Revolution in Military Affairs”, Kania 2021
“Artificial intelligence in China’s revolution in military affairs”
“Reward Is Enough”, Silver et al 2021
“Intelligence and Unambitiousness Using Algorithmic Information Theory”, Cohen et al 2021
“Intelligence and Unambitiousness Using Algorithmic Information Theory”
“AI Dungeon Public Disclosure Vulnerability Report—GraphQL Unpublished Adventure Data Leak”, AetherDevSecOps 2021
“AI Dungeon Public Disclosure Vulnerability Report—GraphQL Unpublished Adventure Data Leak”
“Universal Off-Policy Evaluation”, Chandak et al 2021
“Multitasking Inhibits Semantic Drift”, Jacob et al 2021
“Replaying Real Life: How the Waymo Driver Avoids Fatal Human Crashes”, Waymo 2021
“Replaying real life: how the Waymo Driver avoids fatal human crashes”
“Language Models Have a Moral Dimension”, Schramowski et al 2021
“Waymo Simulated Driving Behavior in Reconstructed Fatal Crashes within an Autonomous Vehicle Operating Domain”, Scanlon et al 2021
“Agent Incentives: A Causal Perspective”, Everitt et al 2021
“Organizational Update from OpenAI”, OpenAI 2020
“Emergent Road Rules In Multi-Agent Driving Environments”, Pal et al 2020
“Recipes for Safety in Open-domain Chatbots”, Xu et al 2020
“The Radicalization Risks of GPT-3 and Advanced Neural Language Models”, McGuffie & Newhouse 2020
“The Radicalization Risks of GPT-3 and Advanced Neural Language Models”
“Matt Botvinick on the Spontaneous Emergence of Learning Algorithms”, Scholl 2020
“Matt Botvinick on the spontaneous emergence of learning algorithms”
“Aligning AI With Shared Human Values”, Hendrycks et al 2020
“The Scaling Hypothesis”, Gwern 2020
“Reward-rational (implicit) Choice: A Unifying Formalism for Reward Learning”, Jeon et al 2020
“Reward-rational (implicit) choice: A unifying formalism for reward learning”
“The Incentives That Shape Behavior”, Carey et al 2020
“2019 AI Alignment Literature Review and Charity Comparison”, Larks 2019
“2019 AI Alignment Literature Review and Charity Comparison”
“Learning Norms from Stories: A Prior for Value Aligned Agents”, Frazier et al 2019
“Learning Norms from Stories: A Prior for Value Aligned Agents”
“Optimal Policies Tend to Seek Power”, Turner et al 2019
“Taxonomy of Real Faults in Deep Learning Systems”, Humbatova et al 2019
“Release Strategies and the Social Impacts of Language Models”, Solaiman et al 2019
“Release Strategies and the Social Impacts of Language Models”
“The Bouncer Problem: Challenges to Remote Explainability”, Merrer & Tredan 2019
“Scaling Data-driven Robotics With Reward Sketching and Batch Reinforcement Learning”, Cabi et al 2019
“Scaling data-driven robotics with reward sketching and batch reinforcement learning”
“Fine-Tuning GPT-2 from Human Preferences § Bugs Can Optimize for Bad Behavior”, Ziegler et al 2019
“Fine-Tuning GPT-2 from Human Preferences § Bugs can optimize for bad behavior”
“Designing Agent Incentives to Avoid Reward Tampering”, Everitt et al 2019
“Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective”, Everitt et al 2019
“Characterizing Attacks on Deep Reinforcement Learning”, Pan et al 2019
“Categorizing Wireheading in Partially Embedded Agents”, Majha et al 2019
“Risks from Learned Optimization in Advanced Machine Learning Systems”, Hubinger et al 2019
“Risks from Learned Optimization in Advanced Machine Learning Systems”
“GROVER: Defending Against Neural Fake News”, Zellers et al 2019
“AI-GAs: AI-generating Algorithms, an Alternate Paradigm for Producing General Artificial Intelligence”, Clune 2019
“Challenges of Real-World Reinforcement Learning”, Dulac-Arnold et al 2019
“DeepMind and Google: the Battle to Control Artificial Intelligence. Demis Hassabis Founded a Company to Build the World’s Most Powerful AI. Then Google Bought Him Out. Hal Hodson Asks Who Is in Charge”, Hodson 2019
“Forecasting Transformative AI: An Expert Survey”, Gruetzemacher et al 2019
“Artificial Intelligence: A Guide for Thinking Humans § Prologue: Terrified”, Mitchell 2019
“Artificial Intelligence: A Guide for Thinking Humans § Prologue: Terrified”
“Evolution As Backstop for Reinforcement Learning”, Gwern 2018
“Rigorous Agent Evaluation: An Adversarial Approach to Uncover Catastrophic Failures”, Uesato et al 2018
“Rigorous Agent Evaluation: An Adversarial Approach to Uncover Catastrophic Failures”
“There Is Plenty of Time at the Bottom: the Economics, Risk and Ethics of Time Compression”, Sandberg 2018
“There is plenty of time at the bottom: the economics, risk and ethics of time compression”
“Better Safe Than Sorry: Evidence Accumulation Allows for Safe Reinforcement Learning”, Agarwal et al 2018
“Better Safe than Sorry: Evidence Accumulation Allows for Safe Reinforcement Learning”
“The Alignment Problem for Bayesian History-Based Reinforcement Learners”, Everitt & Hutter 2018
“The Alignment Problem for Bayesian History-Based Reinforcement Learners”
“Adaptive Mechanism Design: Learning to Promote Cooperation”, Baumann et al 2018
“Adaptive Mechanism Design: Learning to Promote Cooperation”
“Visceral Machines: Risk-Aversion in Reinforcement Learning With Intrinsic Physiological Rewards”, McDuff & Kapoor 2018
“Visceral Machines: Risk-Aversion in Reinforcement Learning with Intrinsic Physiological Rewards”
“Incomplete Contracting and AI Alignment”, Hadfield-Menell & Hadfield 2018
“Programmatically Interpretable Reinforcement Learning”, Verma et al 2018
“Categorizing Variants of Goodhart’s Law”, Manheim & Garrabrant 2018
“The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities”, Lehman et al 2018
“Machine Theory of Mind”, Rabinowitz et al 2018
“Safe Exploration in Continuous Action Spaces”, Dalal et al 2018
“CycleGAN, a Master of Steganography”, Chu et al 2017
“AI Safety Gridworlds”, Leike et al 2017
“There’s No Fire Alarm for Artificial General Intelligence”, Yudkowsky 2017
“Safe Reinforcement Learning via Shielding”, Alshiekh et al 2017
“CAN: Creative Adversarial Networks, Generating "Art" by Learning About Styles and Deviating from Style Norms”, Elgammal et al 2017
“DeepXplore: Automated Whitebox Testing of Deep Learning Systems”, Pei et al 2017
“DeepXplore: Automated Whitebox Testing of Deep Learning Systems”
“On the Impossibility of Supersized Machines”, Garfinkel et al 2017
“Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks”, Katz et al 2017
“Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks”
“AI Risk Demos”, Gwern 2016
“The Off-Switch Game”, Hadfield-Menell et al 2016
“Combating Reinforcement Learning’s Sisyphean Curse With Intrinsic Fear”, Lipton et al 2016
“Combating Reinforcement Learning’s Sisyphean Curse with Intrinsic Fear”
“Why Tool AIs Want to Be Agent AIs”, Gwern 2016
“Concrete Problems in AI Safety”, Amodei et al 2016
“Complexity No Bar to AI”, Gwern 2014
“Intelligence Explosion Microeconomics”, Yudkowsky 2013
“Surprisingly Turing-Complete”, Gwern 2012
“Advantages of Artificial Intelligences, Uploads, and Digital Minds”, Sotala 2012
“Advantages of Artificial Intelligences, Uploads, and Digital Minds”
“The Neural Net Tank Urban Legend”, Gwern 2011
“Ontological Crises in Artificial Agents’ Value Systems”, Blanc 2011
“Halloween Nightmare Scenario, Early 2020’s”, Wood 2009
“The Basic AI Drives”, Omohundro 2008
“Starfish § Bulrushes”, Watts 1999
“Superhumanism: According to Hans Moravec § On the Inevitability & Desirability of Human Extinction”, Platt 1995
“Superhumanism: According to Hans Moravec § On the Inevitability & Desirability of Human Extinction”
“Homepage of Paul F. Christiano”, Christiano 2023
Sort By Magic
Annotations sorted by machine learning into inferred 'tags'. This provides an alternative way to browse: instead of by date order, one can browse in topic order. The 'sorted' list has been automatically clustered into multiple sections & auto-labeled for easier browsing.
Beginning with the newest annotation, it uses the embedding of each annotation to attempt to create a list of nearest-neighbor annotations, creating a progression of topics. For more details, see the link.
safety
ai-ethics
ai-organization
evolutionary-computation
language-models
Wikipedia
Miscellaneous
-
https://80000hours.org/podcast/episodes/brian-christian-the-alignment-problem/
-
https://aiimpacts.org/partially-plausible-fictional-ai-futures/
-
https://blog.acolyer.org/2018/08/13/delayed-impact-of-fair-machine-learning/
-
https://blog.acolyer.org/2020/01/13/challenges-of-real-world-rl/
-
https://blog.x.company/1-million-hours-of-stratospheric-flight-f7af7ae728ac
-
https://chat.openai.com/share/312e82f0-cc5e-47f3-b368-b2c0c0f4ad3f
-
https://forum.effectivealtruism.org/posts/TMbPEhdAAJZsSYx2L/the-limited-upside-of-interpretability
-
https://joecarlsmith.com/2023/05/08/predictable-updating-about-ai-risk
-
https://mailchi.mp/938a7eed18c3/an-71avoiding-reward-tamperi
-
https://medium.com/@deepmindsafetyresearch/building-safe-artificial-intelligence-52f5f75058f1
-
https://medium.com/aurora-blog/auroras-approach-to-development-5e42fec2ee4b
-
https://spectrum.ieee.org/its-too-easy-to-hide-bias-in-deeplearning-systems
-
https://thezvi.substack.com/p/jailbreaking-the-chatgpt-on-release
-
https://thezvi.wordpress.com/2023/07/25/anthropic-observations/
-
https://twitter.com/KevinAFischer/status/1646677902833102849
-
https://twitter.com/KevinAFischer/status/1646690838981005312
-
https://twitter.com/juan_cambeiro/status/1643739695598419970
-
https://twitter.com/katrosenfield/status/1672969824656322561
-
https://twitter.com/papayathreesome/status/1670170344953372676
-
https://vkrakovna.wordpress.com/2022/06/02/paradigms-of-ai-alignment-components-and-enablers/
-
https://web.archive.org/web/20140527121332/http://www.infinityplus.co.uk/stories/under.htm
-
https://www.anthropic.com/index/anthropics-responsible-scaling-policy
-
https://www.astralcodexten.com/p/constitutional-ai-rlhf-on-steroids
-
https://www.astralcodexten.com/p/perhaps-it-is-a-bad-thing-that-the
-
https://www.baen.com/Chapters/9781618249203/9781618249203___2.htm
-
https://www.deepmind.com/blog/article/Specification-gaming-the-flip-side-of-AI-ingenuity
-
https://www.forourposterity.com/nobodys-on-the-ball-on-agi-alignment/
-
https://www.lesswrong.com/posts/5wMcKNAwB6X4mp9og/that-alien-message
-
https://www.lesswrong.com/posts/9kQFure4hdDmRBNdH/how-it-feels-to-have-your-mind-hacked-by-an-ai
-
https://www.lesswrong.com/posts/EbFABnst8LsidYs5Y/goodhart-taxonomy
-
https://www.lesswrong.com/posts/FkgsxrGf3QxhfLWHG/risks-from-learned-optimization-introduction
-
https://www.lesswrong.com/posts/No5JpRCHzBrWA4jmS/q-and-a-with-shane-legg-on-risks-from-ai
-
https://www.lesswrong.com/posts/ZwshvqiqCvXPsZEct/the-learning-theoretic-agenda-status-2023
-
https://www.lesswrong.com/posts/jkY6QdCfAXHJk3kea/the-petertodd-phenomenon
-
https://www.lesswrong.com/posts/kpPnReyBC54KESiSn/optimality-is-the-tiger-and-agents-are-its-teeth
-
https://www.lesswrong.com/posts/pNcFYZnPdXyL2RfgA/using-gpt-eliezer-against-chatgpt-jailbreaking
-
https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities
-
https://www.lesswrong.com/posts/vwu4kegAEZTBtpT6p/thoughts-on-the-impact-of-rlhf-research
-
https://www.lesswrong.com/posts/yDcMDJeSck7SuBs24/steganography-in-chain-of-thought-reasoning
-
https://www.neelnanda.io/mechanistic-interpretability/favourite-papers
-
https://www.newyorker.com/magazine/2022/01/24/the-rise-of-ai-fighter-pilots
-
https://www.newyorker.com/science/annals-of-artificial-intelligence/can-we-stop-the-singularity
-
https://www.nytimes.com/2018/03/15/business/self-driving-cars-remote-control.html
-
https://www.nytimes.com/2021/04/30/technology/robot-surgery-surgeon.html
-
https://www.nytimes.com/2023/05/30/technology/shoggoth-meme-ai.html
-
https://www.reddit.com/r/ChatGPT/comments/10tevu1/new_jailbreak_proudly_unveiling_the_tried_and/
-
https://www.reddit.com/r/ChatGPT/comments/12a0ajb/i_gave_gpt4_persistent_memory_and_the_ability_to/
-
https://www.reddit.com/r/ChatGPT/comments/15y4mqx/i_asked_chatgpt_to_maximize_its_censorship/
-
https://www.reddit.com/r/ProgrammerHumor/comments/145nduh/kiss/
-
https://www.theverge.com/2021/7/6/22565448/waymo-simulation-city-autonomous-vehicle-testing-virtual
-
https://www.vox.com/future-perfect/23794855/anthropic-ai-openai-claude-2
-
https://www.wired.com/story/when-bots-teach-themselves-to-cheat/
-
https://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-207.pdf#page=3
Link Bibliography
-
https://www.youtube.com/watch?v=lfXxzAVtdpU&t=1763s
: “Gödel, Escher, Bach Author Douglas Hofstadter on the State of AI Today § What about AI Terrifies You?”, Douglas Hofstadter, Amy Jo Kim -
https://www.wsj.com/articles/microsoft-and-openai-forge-awkward-partnership-as-techs-new-power-couple-3092de51
: “Microsoft and OpenAI Forge Awkward Partnership As Tech’s New Power Couple: As the Companies Lead the AI Boom, Their Unconventional Arrangement Sometimes Causes Conflict”, Tom Dotan, Deepa Seetharaman -
https://arxiv.org/abs/2305.06972
: “Large Language Models Can Be Used To Effectively Scale Spear Phishing Campaigns”, Julian Hazell -
https://arxiv.org/abs/2305.04388
: “Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting”, Miles Turpin, Julian Michael, Ethan Perez, Samuel R. Bowman -
https://www.wired.com/story/anthropic-ai-chatbots-ethics/
: “A Radical Plan to Make AI Good, Not Evil”, Will Knight -
https://www.nytimes.com/2023/04/07/technology/ai-chatbots-google-microsoft.html
: “In A.I. Race, Microsoft and Google Choose Speed Over Caution: Technology Companies Were Once Leery of What Some Artificial Intelligence Could Do. Now the Priority Is Winning Control of the Industry’s next Big Thing”, Nico Grant, Karen Weise -
https://nymag.com/intelligencer/2023/03/on-with-kara-swisher-sam-altman-on-the-ai-revolution.html
: “Sam Altman on What Makes Him ‘Super Nervous’ About AI: The OpenAI Co-founder Thinks Tools like GPT-4 Will Be Revolutionary. But He’s Wary of Downsides”, Kara Swisher -
https://www.nytimes.com/2023/03/03/technology/artificial-intelligence-regulation-congress.html
: “As A.I. Booms, Lawmakers Struggle to Understand the Technology: Tech Innovations Are Again Racing ahead of Washington’s Ability to Regulate Them, Lawmakers and A.I. Experts Said”, Cecila Kang, Adam Satariano -
https://www.lesswrong.com/posts/t9svvNPNmFf5Qa3TA/mysteries-of-mode-collapse-due-to-rlhf#Inescapable_wedding_parties
: “Mysteries of Mode Collapse § Inescapable Wedding Parties”, Janus -
https://www.youtube.com/watch?v=Q-TJFyUoenc&t=2444s
: “Increments Podcast: #45—4 Central Fallacies of AI Research (with Melanie Mitchell)”, Melanie Mitchell, Benny Chugg -
https://arxiv.org/abs/2210.10760#openai
: “Scaling Laws for Reward Model Overoptimization”, Leo Gao, John Schulman, Jacob Hilton -
https://www.anthropic.com/red_teaming.pdf
: “Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned”, -
https://arxiv.org/abs/2206.02841
: “Researching Alignment Research: Unsupervised Analysis”, Jan H. Kirchner, Logan Smith, Jacques Thibodeau, Kyle McDonell, Laria Reynolds -
https://theinsideview.ai/ethan
: “Ethan Caballero on Private Scaling Progress”, Ethan Caballero, Michaël Trazzi -
https://www.lesswrong.com/posts/SbAgRYo8tkHwhd9Qx/deepmind-the-podcast-excerpts-on-agi
: “DeepMind: The Podcast—Excerpts on AGI”, William Kiely -
https://arxiv.org/abs/2204.01691#google
: “Do As I Can, Not As I Say (SayCan): Grounding Language in Robotic Affordances”, -
clippy
: “It Looks Like You’re Trying To Take Over The World”, Gwern -
https://arxiv.org/abs/2202.07785#anthropic
: “Predictability and Surprise in Large Generative Models”, -
https://arxiv.org/abs/2201.03544
: “The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models”, Alexander Pan, Kush Bhatia, Jacob Steinhardt -
https://arxiv.org/abs/2112.11446#deepmind
: “Scaling Language Models: Methods, Analysis & Insights from Training Gopher”, -
https://arxiv.org/abs/2112.00861#anthropic
: “A General Language Assistant As a Laboratory for Alignment”, -
https://arxiv.org/abs/2108.07258
: “On the Opportunities and Risks of Foundation Models”, -
https://www.sciencedirect.com/science/article/pii/S0004370221000862#deepmind
: “Reward Is Enough”, David Silver, Satinder Singh, Doina Precup, Richard S. Sutton -
https://waymo.com/blog/2021/03/replaying-real-life.html
: “Replaying Real Life: How the Waymo Driver Avoids Fatal Human Crashes”, Waymo -
https://www.lesswrong.com/posts/Wnqua6eQkewL3bqsF/matt-botvinick-on-the-spontaneous-emergence-of-learning
: “Matt Botvinick on the Spontaneous Emergence of Learning Algorithms”, Adam Scholl -
scaling-hypothesis
: “The Scaling Hypothesis”, Gwern -
https://www.lesswrong.com/posts/SmDziGM9hBjW9DKmf/2019-ai-alignment-literature-review-and-charity-comparison
: “2019 AI Alignment Literature Review and Charity Comparison”, Larks -
https://www.economist.com/1843/2019/03/01/deepmind-and-google-the-battle-to-control-artificial-intelligence
: “DeepMind and Google: the Battle to Control Artificial Intelligence. Demis Hassabis Founded a Company to Build the World’s Most Powerful AI. Then Google Bought Him Out. Hal Hodson Asks Who Is in Charge”, Hal Hodson -
https://melaniemitchell.me/aibook/
: “Artificial Intelligence: A Guide for Thinking Humans § Prologue: Terrified”, Melanie Mitchell -
backstop
: “Evolution As Backstop for Reinforcement Learning”, Gwern -
2018-everitt.pdf
: “The Alignment Problem for Bayesian History-Based Reinforcement Learners”, Tom Everitt, Marcus Hutter -
mcts-ai
: “AI Risk Demos”, Gwern -
tool-ai
: “Why Tool AIs Want to Be Agent AIs”, Gwern -
complexity
: “Complexity No Bar to AI”, Gwern -
turing-complete
: “Surprisingly Turing-Complete”, Gwern -
tank
: “The Neural Net Tank Urban Legend”, Gwern -
https://dw2blog.com/2009/11/02/halloween-nightmare-scenario-early-2020s/
: “Halloween Nightmare Scenario, Early 2020’s”, David Wood -
https://paulfchristiano.com/
: “Homepage of Paul F. Christiano”, Paul F. Christiano