This is hugely misleading. If your bot just memorizes Shakespeare and output segments from memorization, of course nobody can tell the difference. But as soon as you start interacting with them the difference can't be more pronounced.
>With these two evaluation sets, we conducted a blind pairwise comparison by asking approximately 100 evaluators on Amazon Mechanical Turk platform to compare the quality of model outputs on these held-out sets of prompts. In the ratings interface, we present each rater with an input prompt and the output of two models. They are then asked to judge which output is better (or that they are equally good) using criteria related to response quality and correctness.
No, it's not just memorising shakespeare, real humans interacted with the models and rated them.
That's not what I meant by interaction. The evaluator had to ask the models to do tasks for them that they thought of by their own. Otherwise there are just too many ways that information could have leaked.
OpenAI's model isn't immune from this either, so take any so-called evaluation metrics with a huge grain of salt. This also highlights the difficulties of properly evaluating LLMs: any metrics, once set up, can become a memorization target for LLMs and lose their meaning.