“‘AI Safety’ Tag”,2019-09-08 ():
![]()
Bibliography for tag
reinforcement-learning/safe, most recent first: 9 related tags, 258 annotations, & 155 links (parent).
- See Also
- Gwern
- “What Do You Do After ‘Winning’ an AI Arms Race?”, 2024
- “What Is an ‘AI Warning Shot’?”, 2024
- “The Neural Net Tank Urban Legend”, 2011
- “It Looks Like You’re Trying To Take Over The World”, 2022
- “Surprisingly Turing-Complete”, 2012
- “The Scaling Hypothesis”, 2020
- “Evolution As Backstop for Reinforcement Learning”, 2018
- “Complexity No Bar to AI”, 2014
- “Why Tool AIs Want to Be Agent AIs”, 2016
- “AI Risk Demos”, 2016
- Links
- “Memorandum on Advancing the United States’ Leadership in Artificial Intelligence”, 2024
- “Machines of Loving Grace: How AI Could Transform the World for the Better”, 2024
- “Strategic Insights from Simulation Gaming of AI Race Dynamics”, et al 2024
- “Towards a Law of Iterated Expectations for Heuristic Estimators”, et al 2024
- “Language Models Learn to Mislead Humans via RLHF”, et al 2024
- “OpenAI Co-Founder Sutskever’s New Safety-Focused AI Startup SSI Raises $1 Billion”, et al 2024
- “Motor Physics: Safety Implications of Geared Motors”, 2024
- “Is Xi Jinping an AI Doomer? China’s Elite Is Split over Artificial Intelligence”, 2024
- “Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?”, et al 2024
- “Resolution of the Central Committee of the Communist Party of China on Further Deepening Reform Comprehensively to Advance Chinese Modernization § Pg58”, China 2024 (page 58)
- “On Scalable Oversight With Weak LLMs Judging Strong LLMs”, et al 2024
- “Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs”, et al 2024
- “Ilya Sutskever Has a New Plan for Safe Superintelligence: OpenAI’s Co-Founder Discloses His Plans to Continue His Work at a New Research Lab Focused on Artificial General Intelligence”, 2024
- “Super(ficial)-Alignment: Strong Models May Deceive Weak Models in Weak-To-Strong Generalization”, et al 2024
- “Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models”, et al 2024
- “AI Sandbagging: Language Models Can Strategically Underperform on Evaluations”, et al 2024
- “Safety Alignment Should Be Made More Than Just a Few Tokens Deep”, et al 2024
- “I Wish I Knew How to Force Quit You”, 2024
- “OpenAI Board Forms Safety and Security Committee: This New Committee Is Responsible for Making Recommendations on Critical Safety and Security Decisions for All OpenAI Projects; Recommendations in 90 Days”, OpenAI 2024
- “OpenAI Begins Training next AI Model As It Battles Safety Concerns: Executive Appears to Backtrack on Start-Up’s Vision of Building ‘Superintelligence’ After Exits from ‘Superalignment’ Team”, 2024
- janleike @ “2024-05-28”
- “OpenAI Promised 20% of Its Computing Power to Combat the Most Dangerous Kind of AI—But Never Delivered, Sources Say”, 2024
- “AI Is a Black Box. Anthropic Figured Out a Way to Look Inside: What Goes on in Artificial Neural Networks Work Is Largely a Mystery, Even to Their Creators. But Researchers from Anthropic Have Caught a Glimpse”, 2024
- DavidSKrueger @ “2024-05-19”
- “Earnings Call: Tesla Discusses Q1 2024 Challenges and AI Expansion”, 2024
- “SOPHON: Non-Fine-Tunable Learning to Restrain Task Transferability For Pre-Trained Models”, et al 2024
- “Foundational Challenges in Assuring Alignment and Safety of Large Language Models”, et al 2024
- “LLM Evaluators Recognize and Favor Their Own Generations”, et al 2024
- “Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression”, et al 2024
- “When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback”, et al 2024
- “Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training”, et al 2024
- “Thousands of AI Authors on the Future of AI”, et al 2024
- “Using Dictionary Learning Features As Classifiers”
- “Exploiting Novel GPT-4 APIs”, et al 2023
- “Comparison of Waymo Rider-Only Crash Data to Human Benchmarks at 7.1 Million Miles”, et al 2023
- “Challenges With Unsupervised LLM Knowledge Discovery”, et al 2023
- “Politics and the Future”, 2023
- “Helping or Herding? Reward Model Ensembles Mitigate but Do Not Eliminate Reward Hacking”, et al 2023
- “The Inside Story of Microsoft’s Partnership With OpenAI: The Companies Had Honed a Protocol for Releasing Artificial Intelligence Ambitiously but Safely. Then OpenAI’s Board Exploded All Their Carefully Laid Plans”, 2023
- “How Jensen Huang’s Nvidia Is Powering the AI Revolution: The Company’s CEO Bet It All on a New Kind of Chip. Now That Nvidia Is One of the Biggest Companies in the World, What Will He Do Next?”, 2023
- “Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching”, et al 2023
- “Did I Get Sam Altman Fired from OpenAI?: Nathan’s Red-Teaming Experience, Noticing How the Board Was Not Aware of GPT-4 Jailbreaks & Had Not Even Tried GPT-4 prior to Its Early Release”, 2023
- “Did I Get Sam Altman Fired from OpenAI? § GPT-4-Base”, 2023
- “Inside the Chaos at OpenAI: Sam Altman’s Weekend of Shock and Drama Began a Year Ago, With the Release of ChatGPT”, 2023
- “OpenAI Announces Leadership Transition”, et al 2023
- “On Measuring Faithfulness or Self-Consistency of Natural Language Explanations”, 2023
- “In-Context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering”, et al 2023
- “Removing RLHF Protections in GPT-4 via Fine-Tuning”, et al 2023
- “Large Language Models Can Strategically Deceive Their Users When Put Under Pressure”, et al 2023
- “Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation”, et al 2023
- “Augmenting Large Language Models With Chemistry Tools”, et al 2023
- “Preventing Language Models From Hiding Their Reasoning”, 2023
- “Will Releasing the Weights of Large Language Models Grant Widespread Access to Pandemic Agents?”, et al 2023
- “Specific versus General Principles for Constitutional AI”, et al 2023
- “Goodhart’s Law in Reinforcement Learning”, et al 2023
- “Let Models Speak Ciphers: Multiagent Debate through Embeddings”, et al 2023
- “Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!”, et al 2023
- “Representation Engineering: A Top-Down Approach to AI Transparency”, et al 2023
- “Responsibility & Safety: Our Approach”, Deep2023
- “STARC: A General Framework For Quantifying Differences Between Reward Functions”, et al 2023
- “How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions”, et al 2023
- “What If the Robots Were Very Nice While They Took Over the World?”, 2023
- “Taken out of Context: On Measuring Situational Awareness in LLMs”, et al 2023
- “AI Deception: A Survey of Examples, Risks, and Potential Solutions”, et al 2023
- “Simple Synthetic Data Reduces Sycophancy in Large Language Models”, et al 2023
- “Does Sam Altman Know What He’s Creating? The OpenAI CEO’s Ambitious, Ingenious, Terrifying Quest to Create a New Form of Intelligence”, 2023
- “Question Decomposition Improves the Faithfulness of Model-Generated Reasoning”, et al 2023
- “Hoodwinked: Deception and Cooperation in a Text-Based Game for Language Models”, 2023
- “Introducing Superalignment”, 2023
- “Gödel, Escher, Bach Author Douglas Hofstadter on the State of AI Today § What about AI Terrifies You?”, 2023
- “Microsoft and OpenAI Forge Awkward Partnership As Tech’s New Power Couple: As the Companies Lead the AI Boom, Their Unconventional Arrangement Sometimes Causes Conflict”, 2023
- “Can Large Language Models Democratize Access to Dual-Use Biotechnology?”, et al 2023
- “Survival Instinct in Offline Reinforcement Learning”, et al 2023
- “Thought Cloning: Learning to Think While Acting by Imitating Human Thinking”, 2023
- “The Challenge of Advanced Cyberwar and the Place of Cyberpeace”, 2023
- “Incentivizing Honest Performative Predictions With Proper Scoring Rules”, et al 2023
- “Large Language Models Can Be Used To Effectively Scale Spear Phishing Campaigns”, 2023
- “A Radical Plan to Make AI Good, Not Evil”, 2023
- “Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-Of-Thought Prompting”, et al 2023
- “Mitigating Lies in Vision-Language Models”, et al 2023
- “Fundamental Limitations of Alignment in Large Language Models”, et al 2023
- “Even The Politicians Thought the Open Letter Made No Sense In The Senate Hearing on AI Today’s Hearing on Ai Covered Ai Regulation and Challenges, and the Infamous Open Letter, Which Nearly Everyone in the Room Thought Was Unwise”, 2023
- “In AI Race, Microsoft and Google Choose Speed Over Caution: Technology Companies Were Once Leery of What Some Artificial Intelligence Could Do. Now the Priority Is Winning Control of the Industry’s next Big Thing”, 2023
- “8 Things to Know about Large Language Models”, 2023
- “Sam Altman on What Makes Him ‘Super Nervous’ About AI: The OpenAI Co-Founder Thinks Tools like GPT-4 Will Be Revolutionary. But He’s Wary of Downsides”, 2023
- “The OpenAI CEO Disagrees With the Forecast That AI Will Kill Us All: An Artificial Intelligence Twitter Beef, Explained”, 2023
- “As AI Booms, Lawmakers Struggle to Understand the Technology: Tech Innovations Are Again Racing ahead of Washington’s Ability to Regulate Them, Lawmakers and AI Experts Said”, 2023
- “Pretraining Language Models With Human Preferences”, et al 2023
- “Conditioning Predictive Models: Risks and Strategies”, et al 2023
- “Tracr: Compiled Transformers As a Laboratory for Interpretability”, et al 2023
- “Specification Gaming Examples in AI”
- “Discovering Language Model Behaviors With Model-Written Evaluations”, et al 2022
- “Discovering Latent Knowledge in Language Models Without Supervision”, et al 2022
- “Embedding Synthetic Off-Policy Experience for Autonomous Driving via Zero-Shot Curricula”, et al 2022
- “Interpreting Neural Networks through the Polytope Lens”, et al 2022
- “Mysteries of Mode Collapse § Inescapable Wedding Parties”, 2022
- “Measuring Progress on Scalable Oversight for Large Language Models”, et al 2022
- “Increments Podcast: #45—4 Central Fallacies of AI Research (with Melanie Mitchell)”, 2022
- “Broken Neural Scaling Laws”, et al 2022
- “Scaling Laws for Reward Model Overoptimization”, et al 2022
- “Defining and Characterizing Reward Hacking”, et al 2022
- “The Alignment Problem from a Deep Learning Perspective”, 2022
- “Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned”, et al 2022
- “Modeling Transformative AI Risks (MTAIR) Project—Summary Report”, et al 2022
- “Researching Alignment Research: Unsupervised Analysis”, et al 2022
- “Ethan Caballero on Private Scaling Progress”, 2022
- “DeepMind: The Podcast—Excerpts on AGI”, 2022
- “Do As I Can, Not As I Say (SayCan): Grounding Language in Robotic Affordances”, et al 2022
- “Predictability and Surprise in Large Generative Models”, et al 2022
- “Uncalibrated Models Can Improve Human-AI Collaboration”, et al 2022
- “DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-To-Image Generative Transformers”, et al 2022
- “Safe Deep RL in 3D Environments Using Human Feedback”, et al 2022
- “LaMDA: Language Models for Dialog Applications”, et al 2022
- “The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models”, et al 2022
- “Scaling Language Models: Methods, Analysis & Insights from Training Gopher”, et al 2021
- “A General Language Assistant As a Laboratory for Alignment”, et al 2021
- “What Would Jiminy Cricket Do? Towards Agents That Behave Morally”, et al 2021
- “Can Machines Learn Morality? The Delphi Experiment”, et al 2021
- “SafetyNet: Safe Planning for Real-World Self-Driving Vehicles Using Machine-Learned Policies”, et al 2021
- “Unsolved Problems in ML Safety”, et al 2021
- “An Empirical Cybersecurity Evaluation of GitHub Copilot’s Code Contributions”, et al 2021
- “On the Opportunities and Risks of Foundation Models”, et al 2021
- “Evaluating Large Language Models Trained on Code”, et al 2021
- “Randomness In Neural Network Training: Characterizing The Impact of Tooling”, et al 2021
- “Goal Misgeneralization in Deep Reinforcement Learning”, et al 2021
- “Anthropic Raises $124 Million to Build More Reliable, General AI Systems”, 2021
- “Artificial Intelligence in China’s Revolution in Military Affairs”, 2021
- “Reward Is Enough”, et al 2021
- “Intelligence and Unambitiousness Using Algorithmic Information Theory”, et al 2021
- “AI Dungeon Public Disclosure Vulnerability Report—GraphQL Unpublished Adventure Data Leak”, AetherDevSec2021
- “Universal Off-Policy Evaluation”, et al 2021
- “Multitasking Inhibits Semantic Drift”, et al 2021
- “Waymo Simulated Driving Behavior in Reconstructed Fatal Crashes within an Autonomous Vehicle Operating Domain”, et al 2021
- “Language Models Have a Moral Dimension”, et al 2021
- “Replaying Real Life: How the Waymo Driver Avoids Fatal Human Crashes”, 2021
- “Agent Incentives: A Causal Perspective”, et al 2021
- “Organizational Update from OpenAI”, OpenAI 2020
- “Emergent Road Rules In Multi-Agent Driving Environments”, et al 2020
- “Underspecification Presents Challenges for Credibility in Modern Machine Learning”, et al 2020
- “Recipes for Safety in Open-Domain Chatbots”, et al 2020
- “Hidden Incentives for Auto-Induced Distributional Shift”, et al 2020
- “The Radicalization Risks of GPT-3 and Advanced Neural Language Models”, 2020
- “Matt Botvinick on the Spontaneous Emergence of Learning Algorithms”, 2020
- “ETHICS: Aligning AI With Shared Human Values”, et al 2020
- “Pitfalls of Learning a Reward Function Online”, et al 2020
- “Reward-Rational (implicit) Choice: A Unifying Formalism for Reward Learning”, et al 2020
- “The Incentives That Shape Behavior”, et al 2020
- “2019 AI Alignment Literature Review and Charity Comparison”, 2019
- “Learning Norms from Stories: A Prior for Value Aligned Agents”, et al 2019
- “Optimal Policies Tend to Seek Power”, et al 2019
- “Taxonomy of Real Faults in Deep Learning Systems”, et al 2019
- “Release Strategies and the Social Impacts of Language Models”, et al 2019
- “The Bouncer Problem: Challenges to Remote Explainability”, 2019
- “Scaling Data-Driven Robotics With Reward Sketching and Batch Reinforcement Learning”, et al 2019
- “Fine-Tuning GPT-2 from Human Preferences § Bugs Can Optimize for Bad Behavior”, et al 2019
- “Designing Agent Incentives to Avoid Reward Tampering”, et al 2019
- “Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective”, et al 2019
- “Characterizing Attacks on Deep Reinforcement Learning”, et al 2019
- “Categorizing Wireheading in Partially Embedded Agents”, et al 2019
- “Risks from Learned Optimization in Advanced Machine Learning Systems”, et al 2019
- “GROVER: Defending Against Neural Fake News”, et al 2019
- “AI-GAs: AI-Generating Algorithms, an Alternate Paradigm for Producing General Artificial Intelligence”, 2019
- “Challenges of Real-World Reinforcement Learning”, Dulac- et al 2019
- “DeepMind and Google: the Battle to Control Artificial Intelligence. Demis Hassabis Founded a Company to Build the World’s Most Powerful AI. Then Google Bought Him Out. Hal Hodson Asks Who Is in Charge”, 2019
- “Forecasting Transformative AI: An Expert Survey”, et al 2019
- “Artificial Intelligence: A Guide for Thinking Humans § Prologue: Terrified”, 2019
- “Rigorous Agent Evaluation: An Adversarial Approach to Uncover Catastrophic Failures”, et al 2018
- “There Is Plenty of Time at the Bottom: the Economics, Risk and Ethics of Time Compression”, 2018
- “Better Safe Than Sorry: Evidence Accumulation Allows for Safe Reinforcement Learning”, et al 2018
- “The Alignment Problem for Bayesian History-Based Reinforcement Learners”, 2018
- “Adaptive Mechanism Design: Learning to Promote Cooperation”, et al 2018
- “Visceral Machines: Risk-Aversion in Reinforcement Learning With Intrinsic Physiological Rewards”, 2018
- “Incomplete Contracting and AI Alignment”, Hadfield-2018
- “Programmatically Interpretable Reinforcement Learning”, et al 2018
- “Categorizing Variants of Goodhart’s Law”, 2018
- “The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities”, et al 2018
- “Machine Theory of Mind”, et al 2018
- “Safe Exploration in Continuous Action Spaces”, et al 2018
- “CycleGAN, a Master of Steganography”, et al 2017
- “AI Safety Gridworlds”, et al 2017
- “There’s No Fire Alarm for Artificial General Intelligence”, 2017
- “Safe Reinforcement Learning via Shielding”, et al 2017
- “CAN: Creative Adversarial Networks, Generating “Art” by Learning About Styles and Deviating from Style Norms”, et al 2017
- “DeepXplore: Automated Whitebox Testing of Deep Learning Systems”, et al 2017
- “On the Impossibility of Supersized Machines”, et al 2017
- “Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks”, et al 2017
- “The Off-Switch Game”, Hadfield- et al 2016
- “Combating Reinforcement Learning’s Sisyphean Curse With Intrinsic Fear”, et al 2016
- “Concrete Problems in AI Safety”, et al 2016
- “My Path to OpenAI”, 2016
- “Machine Intelligence, Part 2”, 2015
- “Machine Intelligence, Part 1”, 2015
- gdb @ “2014-05-18”
- “Intelligence Explosion Microeconomics”, 2013
- “The Whispering Earring”, 2012
- “Advantages of Artificial Intelligences, Uploads, and Digital Minds”, 2012
- “Ontological Crises in Artificial Agents’ Value Systems”, 2011
- “The Normalization of Deviance in Healthcare Delivery”, 2010
- “Halloween Nightmare Scenario, Early 2020’s”, 2009
- “Funding Safe AGI”, 2009
- “The Basic AI Drives”, 2008
- “Starfish § Bulrushes”, 1999
- “Superhumanism: According to Hans Moravec § On the Inevitability & Desirability of Human Extinction”, 1995
- “Profile of Claude Shannon”, 1987
- “Afterword to Vernor Vinge’s Novel, True Names”, 1984
- “Meet Shakey: the First Electronic Person—The Fascinating and Fearsome Reality of a Machine With a Mind of Its Own”, 1970
- “Some Moral and Technical Consequences of Automation: As Machines Learn They May Develop Unforeseen Strategies at Rates That Baffle Their Programmers”, 1960
- “Intelligent Machinery, A Heretical Theory”, 1951
- “Brian Christian on the Alignment Problem”
- “Fiction Relevant to AI Futurism”
- “The Ethics of Reward Shaping”
- “Delayed Impact of Fair Machine Learning [Blog]”
- “Challenges of Real-World Reinforcement Learning [Blog]”
- “Janus”
- “Safety-First AI for Autonomous Data Center Cooling and Industrial Control”
- “Specification Gaming Examples in AI—Master List”
- “Are You Really in a Race? The Cautionary Tales of Szilard and Ellsberg”
- “Inverse-Scaling/prize: A Prize for Finding Tasks That Cause Large Language Models to Show Inverse Scaling”
- “Jan Leike”
- “Aurora’s Approach to Development”
- “Homepage of Paul F. Christiano”, 2024
- “‘Rasmussen and Practical Drift: Drift towards Danger and the Normalization of Deviance’, 2017”
- “The Checklist: What Succeeding at AI Safety Will Involve”
- “Safe Superintelligence Inc.”
- “Situational Awareness and Out-Of-Context Reasoning § When Will the Situational Awareness Benchmark Be Saturated?”, 2024
- “Paradigms of AI Alignment: Components and Enablers”
- “Understand —A Novelette by Ted Chiang”
- “Slow Tuesday Night”, 2024
- “Threats From AI: Easy Recipes for Bioweapons Are New Global Security Concern”
- “Carl Shulman #2: AI Takeover, Bio & Cyber Attacks, Detecting Deception, & Humanity’s Far Future”
- “AI Takeoff”
- “That Alien Message”, 2024
- “AXRP Episode 1—Adversarial Policies With Adam Gleave”
- “Preventing Language Models from Hiding Their Reasoning”
- “2021 AI Alignment Literature Review and Charity Comparison”
- “When Your AIs Deceive You: Challenges With Partial Observability in RLHF”
- “Risks from Learned Optimization: Introduction”
- “AI Takeoff Story: a Continuation of Progress by Other Means”
- “Reward Hacking Behavior Can Generalize across Tasks”
- “Security Mindset: Lessons from 20+ Years of Software Security Failures Relevant to AGI Alignment”
- “Research Update: Towards a Law of Iterated Expectations for Heuristic Estimators”
- “A Gym Gridworld Environment for the Treacherous Turn”
- “Model Mis-Specification and Inverse Reinforcement Learning”
- “Interview With Robert Kralisch on Simulators”
- “Survey: How Do Elite Chinese Students Feel About the Risks of AI?”
- “Optimality Is the Tiger, and Agents Are Its Teeth”
- “[AN #114]: Theory-Inspired Safety Solutions for Powerful Bayesian RL Agents”
- “2020 AI Alignment Literature Review and Charity Comparison”
- “Designing Agent Incentives to Avoid Reward Tampering”
- “AGI Ruin: A List of Lethalities”
- “Steganography and the CycleGAN—Alignment Failure Case Study”
- “[AN #161]: Creating Generalizable Reward Functions for Multiple Tasks by Learning a Model of Functional Similarity”
- “Steganography in Chain-Of-Thought Reasoning”
- “The Rise of A.I. Fighter Pilots”
- “When Self-Driving Cars Can’t Help Themselves, Who Takes the Wheel?”
- “The Robot Surgeon Will See You Now”
- “Welcome to Simulation City, the Virtual World Where Waymo Tests Its Autonomous Vehicles”
- “When Bots Teach Themselves to Cheat”
- Sort By Magic
- Wikipedia
- Miscellaneous
- Bibliography