You cannot RL learn an O(N) algorithm in an O(1) feed forward neural network.
You could RL learn that when someone specifies a number, the appropriate thing to say is "Ok, 40 asterisks, let's count them, 1, *, 2, *, 3 , *, ..." and then it would indeed produce 40 asterisks. But not as a single string. Because producing them as a single contiguous string requires some offline state/memory/processing, and all the neural network has access to is the last ~page of text.
Embedding the counting process into the text itself kind of embeds the state of the O(N) algorithm in the O(N) text itself, that is, "unrolling the loop" externally.
It doesn’t have any logic; it just tries to complete strings in the most plausible way. It’s training material probably did not have a lot of “write five at signs: @@@@@“. RLHF might help steer it in the right direction, but probably wouldn’t product the concept of counting or loops.
So, this is where I guess I just don't understand. I've had ChatGPT produce code for me that there is absolutely no way it already had in its training set. I realize it can't actually "think", but then I also don't know how to describe what I'm seeing.