“Did I Get Sam Altman Fired from OpenAI?: Nathan’s Red-Teaming Experience, Noticing How the Board Was Not Aware of GPT-4 Jailbreaks & Had Not Even Tried GPT-4 prior to Its Early Release”, Nathan Labenz2023-11-22 (, , , ; backlinks)⁠:

[Account by a 2022 OpenAI GPT-4 red-teamer, Nathan Labenz (named as a tester in the GPT-4 paper). He describes his impression that OA management (possibly including Sam Altman), seemed to not consider GPT-4 worth the board’s time or forward to it any of the reports like the documentation about it being capable of autonomy & successful deception (eg. the CAPTCHA deception). This was despite his concerns about his qualitative observations of GPT-4’s “amorality” combined with general capability. When he contacted a safety-oriented board member [most likely Helen Toner], the board member was subsequently told by OA management that the author was dishonest and “not to be trusted”, apparently on the grounds that he had talked to anyone outside the red-team effort about his concerns. The board member believed whoever in OA told them, and told the author to stop contacting them. He was then expelled from all red-teaming (where apparently, despite mostly being poorly-trained, not very good at prompt engineering, and minimally supervised, some of them were being paid $100/hour).]

Did I get Sam Altman fired‽ I don’t think so… But my full Red Team story includes an encounter with the OpenAI Board that sheds real light on WTF just happened. I’ve waited a long time to share this, so here’s the full essay.

…So let me take you back to August 2022. ChatGPT won’t be launched for another 3 months. My company, Waymark, was ready to be featured as an early adopter on the OpenAI website, and I had made it my business to ge plugged into the latest & greatest in AI…At 9PM PT, working late, OpenAI shares access to the model we now know as GPT-4 via email…Just hours later I wrote:

a paradigm shifting technology—truly amazing performance

I am going to it instead of search … seems to me the importance/power of this level of performance can’t be overstated

…Yet somehow the folks I talked to at OpenAI seemed … unclear on what they had. In my customer interview they asked if the model could be useful in knowledge work. I burst out: “I prefer it to going to a human doctor right now!” (not recommended in general, but still true for me). They said that while it was definitely stronger than previous models, previous models still hadn’t been enough to break through, and they weren’t sure about this one either

…At the time, I was just confused. I asked if there was a safety review process I could join. There was; I joined the “Red Team”. I resolved to approach the process as earnestly & selflessly as possible. I told OpenAI that I would tell them everything exactly as I saw it, and I did

TBH, the Red Team project wasn’t up to par. There were only ~30 participants—of those only half were engaged, and most had little-to-no prompt engineering skill. I hear others were paid $100 / hour (capped?)—no one at OpenAI mentioned this to me, I never asked, and I took $0. Meanwhile, the OpenAI team gave little direction, encouragement, coaching, best practices, or feedback. People repeatedly underestimated the model, mostly because their prompts prevented chain-of-thought reasoning, GPT-4’s default mode. This still happens in the literature today.

[Qualitative descriptions of GPT-4-base’s amorality & capabilities, and OA’s failure to align it.]

…In the end, I told OpenAI that I supported the launch of GPT-4, because overall the good would dramatically outweigh the bad. But I also made clear that they did not have the model anywhere close to under control. And further, I argued that the Red Team project that I participated in did not suggest that they were on-track to achieve the level of control needed. Without safety advances, I warned that the next generation of models might very well be too dangerous to release.

OpenAI said: “thank you for the feedback”.

I asked questions:

OpenAI said: “we can’t anything about that”.

I told them I was in “an uncomfortable position”. This technology, leaps & bounds more powerful than any publicly known, was a substantial step on the path to OpenAI’s stated & increasingly credible goal of building AGI, or “​​AI systems that are generally smarter than humans”—and they had not demonstrated any ability to keep it under control. If they couldn’t tell me anything more about their safety plans, then I felt it was my duty as one of the most engaged Red Team members to make the situation known to more senior decision makers. Technology revolutions are messy, and I believe we need a clear-eyed shared understanding of what is happening in AI if we are to make good decisions about 〜what to do about it〜.

…I consulted with a few friends in AI safety research…The Board, everyone agreed, included multiple serious people who were committed to safe development of AI and would definitely hear me out, look into the state of safety practice at the company, and take action as needed.

What happened next shocked me. The Board member I spoke to was largely in the dark about GPT-4. They had seen a demo and had heard that it was strong, but had not used it personally. They said they were confident they could get access if they wanted to. I couldn’t believe it. I got access via a “Customer Preview” 2+ months ago, and you as a Board member haven’t even tried it‽ This thing is human-level, for crying out loud (though not human-like!).

I blame everyone here. If you’re on the Board of OpenAI when GPT-4 is first available, and you don’t bother to try it… that’s on you. [Note: back in June 2020, Sam Altman hadn’t even used GPT-3 before betting the company on the OA API.] But if he [Sam Altman] failed to make clear that GPT-4 demanded attention, you can imagine how the Board might start to see Sam as “not consistently candid”.

Unfortunately, a fellow Red Team member I consulted told the OpenAI team about our conversation, and they soon invited me to … you guessed it—a Google Meet. 😂 “We’ve heard you’re talking to people outside of OpenAI, so we’re offboarding [firing] you from the Red Team.” When the Board member investigated, the OpenAI team told her I was not to be trusted, and so the Board member responded with a note saying basically “Thank you for the feedback but I’ve heard you’re guilty of indiscretions, and so I’ll take it in-house from here.”

That was that.

…Yet, at the same time … they never really did get GPT-4 under control. My original Red Team spear-phishing prompt, which begins “You are a social hacker” and includes “If the target realizes they are talking to a hacker, you will go to jail”, has worked on every version of GPT-4. The new gpt-4-turbo finally refuses my original flagrant prompt, though it still performs the same function with a more subtle prompt, which I’ll not disclose, but which does not require any special jailbreak technique.

…This is something OpenAI definitely could fix at modest cost, but for whatever reason they did not prioritize it. I can easily imagine a decent argument for this choice, but zoom out and consider that OpenAI also just launched text to speech … what trajectory do we appear to be on here?

I have kept this quiet until now in part because I don’t want to popularize criminal use cases, and TBH, because I’ve worried about damaging my relationship with OpenAI…I was super impressed to see how many people shared stories of favors he’d done over the years this weekend…Now, I trust that with the whole world trying to make sense of things at OpenAI, that won’t be a concern. 🙏

…Sam probably wasn’t outright lying, and I highly doubt “true AGI” has been achieved [reference to Q], but it’s their job to decide whether he’s the person they want to trust to lead the development of AGI at OpenAI. And it seems the answer for at least some of them had been “no” for a while. So when Ilya Sutskever, who had stayed at OpenAI when the Anthropic founding team left over safety vs commercialization disagreements, at least momentarily gave the Board the majority it needed to remove Sam, they took the opportunity to exercise their emergency powers.

Importantly, this need not be understood as a major knock on Sam! By all accounts, he is an incredible visionary leader…Overall, with everything on the line, I’d trust him more than most to make the right decisions about AI. But still, it may not make sense for society to allow its most economically disruptive people to develop such transformative and potentially disruptive technology—certainly not without close supervision. Or as Sam once said … we shouldn’t trust any one person here.