gwitter
Silicon Dragon
@sdrinf
24 Feb 2023
Lesser known
#ChatGPT
tricks: ask it to assign truthiness floats to responses to bias the model for metacognition. See below for with & without
Feb 24, 2023 · 11:44 AM UTC
1
1
2
Silicon Dragon
@sdrinf
24 Feb 2023
Strongly suspect the model can interally reason about truthiness of answers, and was just not rewarded for truth during RLHF, in the name of maintaining noble lies society tells to itself.
1