×
all 2 comments

[–]karearearea 0 points1 point  (1 child)

It might be a mistake - but it also might not be.

I think o1 was trained on ‘effective reasoning steps’, rather than ‘human understandable reasoning steps’. And by that I mean I think many of the training reasoning chains were generated by another AI model - they probably showed it a problem that was hard to solve but easy to verify (like programming or maths), got it to generate a thousand reasoning chains and answers, and checked whether any of them were correct. If one was, then add it to the training data, if not, repeat.

What happens with AI generated reasoning steps is the reasoning chain doesn’t necessarily need to be correct - it just needs to cause the model to output the right answer. And as the LLM is trained over and over again on these generated reasoning chains, I wouldn’t be surprised if we did see drift away from what a human would produce. Neural nets are great at exploiting small loopholes, and could exploit strange properties of their tokens to influence the probability of outputting a correct answer in a completely unintelligible way. I wouldn’t be surprised if o3 or o4’s reasoning chains looked completely crazy to us, but reference something in its internal model of the world in some clever way. Essentially, it could be developing its own reasoning language only known to it.

Of course, it could just be the summariser or the model making genuine mistakes and might not be this at all - but if the answers are correct, you could also be seeing the first signs of this kind of drift.

[–]ssmith12345uk[S] 2 points3 points  (0 children)

I think what you are describing is that we are getting a window/conversion to the underlying "unaligned" or "raw" reasoning of the model.

I guess I am surprised that the frequency of this behaviour is so high, and that whatever task analysis is taking place leads to a lot of variance in the reasoning steps. Compared with traditional ACoT and GCoT prompts, it's wild - and the outputs seem to be a mixture of either spare tokens or outright hallucinations - this type of stuff simply doesn't happen even at high temperatures with other models.

Still interested to see if anyone else is reviewing these chains critically, or comparing o1 repeatability against a stable input. Since these chains aren't going to be visible via the API (presumably to avoid this type of reverse engineering) I suspect it will remain a mystery!