Wait a second here… I skimmed the paper and GitHub and didn’t find an answer to a very important question: is this GPT3.5 or 4? There’s a huge difference in code quality between the two and either they made a giant accidental omission or they are being intentionally misleading. Please correct me if I missed where they specified that. I’m assuming they were using GPT3.5, so yeah those results would be as expected. On the HumanEval benchmark, GPT4 gets 67% and that goes up to 90% with reflexion prompting. GPT3.5 gets 48.1%, which is exactly what this paper is saying. (source).
I’m talking about the models and how they’re written about in the literature. I don’t care how OpenAI brands their products.
From the paper itself:
For the additional 2000 SO questions, ChatGPT 3.5 Turbo API is used.
Whatever GitHub Copilot uses (the version with the chat feature), I don’t find its code answers to be particularly accurate. Do we know which version that product uses?