This is the best summary I could come up with:
Machine learning researchers had been experimenting with large language models (LLMs) for a few years by that point, but the general public had not been paying close attention and didn’t realize how powerful they had become.
If you know anything about this subject, you’ve probably heard that LLMs are trained to “predict the next word” and that they require huge amounts of text to do this.
Conventional software is created by human programmers, who give computers explicit, step-by-step instructions.
By contrast, ChatGPT is built on a neural network that was trained using billions of words of ordinary language.
Finally, we’ll explain how these models are trained and explore why good performance requires such phenomenally large quantities of data.
I’m a bot and I’m open source!
GPT-4 was able to do this even though the training data for the version tested by the authors was entirely text-based. That is, there were no images in its training set. But GPT-4 apparently learned to reason about the shape of a unicorn’s body after training on a huge amount of written text.
It’s as if they can in some way or other “see”.