Study Finds That 52 Percent of ChatGPT Answers to Programming Questions Are Wrong

You should see 52% of the first version of my code.

It doesn’t have to be right to be useful.

Yeah, but the non-tech savvy business leaders see they can generate code with AI and think ‘why do I need a developer if I have this AI?’ and have no idea whether the code it produces is right or not. This stat should be shared broadly so leaders don’t overestimate the capability and fire people they will desperately need.

Yeah management are all for this, the first few years here are rough with them immediately hitting the “fire the engineers we have ai now”. They won’t realize their fuckup until they’ve been promoted away from it

Programming jobs will be safe for a while. They’ve been trying to eliminate those positions since at least the 90s. Because coders are expensive and often lack social skills.

But I do think the clock is ticking. We will see more and more sophisticated AI tools that are relatively idiot-proof and can do things like modify Salesforce, or create complex new Tableau reports with a few mouse clicks, and stuff like that. Jobs will be chiseled away like our unfortunate friends in graphic design.

You, along with most people, are still looking at automation wrong. It’s never been about removing people entirely, even AI, it’s about doing the same work with less cost.

If you can eliminate one programmers from your four person team by giving the other three AI to produce the same amount of work, congrats you’ve just automated one programming job.

Programming jobs aren’t going anywhere, but either the amount of code produced is about to skyrocket, or the number of employed programmers is going to drop (or most likely both of those things).

I say let it happen. If someone is dumb enough to fire all their workers… They deserve what will happen next

Well the firing’s happening so, i guess let’s hope you’re right about the other part.

It won’t happen like that. Leadership will just under-hire and expect all their developers to be way more efficient. Working will be really stressful with increased deadlines and people questioning why you couldn’t meet them.

Mentioned it before but:

LLMs program at the level of a junior engineer or an intern. You already need code review and more senior engineers to fix that shit for them.

What they do is migrate that. Now that junior engineer has an intern they are trying to work with. Or… companies realize they don’t benefit from training up those newbie (or stupid) engineers when they are likely to leave in a year or two anyway.

And they’ll find out very soon that they need devs when they actually try to test something and nothing works.

Yeah cause my favorite thing to do when programming is debugging someone else’s broken code.

I think where it shines is in helping you write code you’ve never written before. I never touched Swift before and I made a fully functional iOS app in a week. Also, even with stuff I have done before, I can say “write me a function that does x” and it will and it usually works.

Like just yesterday I asked it to write me a function that would generate and serve up an .ics file based on a selected date and extrapolate the date of a recurring monthly meeting based on the day of the week picked and its position (1st week, 2nd week, etc) within the month and then make the .ics file reflect all that. I could have generated that code myself by hand but it would have probably taken me an hour or two. It did it in about five seconds and it worked perfectly.

Yeah, you have to know what you’re doing in general and there’s a lot of babysitting involved, but anyone who thinks it’s just useless is plain wrong. It’s fucking amazing.

Edit: lol the article is referring to a study that was using GPT 3.5, which is all but useless for coding. 4.0 has been out for a year blowing everybody’s minds. Clickbait trash.

3.5 is still reasonably useful for the same reasons you described, imo… Just less so.

Get it to debug itself then.

Yeah I’ve already got enough legacy code to deal with, I don’t need more of it faster.

To be fair, I’m starting to fear that all the fun bits of human jobs are the ones that are most easy to automate.

I dread the day I’m stuck playing project manager to a bunch of chat bots.

Generally you want to the reference material used to improve that first version to be correct though. Otherwise it’s just swapping one problem for another.

I wouldn’t use a textbook that was 52% incorrect, the same should apply to a chatbot.

Bad take. Is the first version of your code the one that you deliver or push upstream?

LLMs can give great starting points, I use multiple LLMs each for various reasons. Usually to clean up something I wrote (too lazy or too busy/stressed to do manually), find a problem with the logic, or maybe even brainstorm ideas.

I rarely ever use it to generate blocks of code like asking it to generate “a method that takes X inputs and does Y operations, and returns Z value”. I find that those kinds of results are often vastly wrong or just done in a way that doesn’t fit with other things I’m doing.

LLMs can give great starting points, I use multiple LLMs each for various reasons. Usually to clean up something I wrote (too lazy or too busy/stressed to do manually), find a problem with the logic, or maybe even brainstorm ideas.

Impressed some folks think LLMs are useless. Not sure if their lives/workflows/brains are that different from ours or they haven’t given at the college try.

I almost always have to use my head before a language model’s output is useful for a given purpose. The tool almost always saves me time, improves the end result, or both. Usually both, I would say.

It’s a very dangerous technology that is known to output utter garbage and make enormous mistakes. Still, it routinely blows my mind.

This is the best summary I could come up with:

In recent years, computer programmers have flocked to chatbots like OpenAI’s ChatGPT to help them code, dealing a blow to places like Stack Overflow, which had to lay off nearly 30 percent of its staff last year.

That’s a staggeringly large proportion for a program that people are relying on to be accurate and precise, underlining what other end users like writers and teachers are experiencing: AI platforms like ChatGPT often hallucinate totally incorrectly answers out of thin air.

For the study, the researchers looked over 517 questions in Stack Overflow and analyzed ChatGPT’s attempt to answer them.

The team also performed a linguistic analysis of 2,000 randomly selected ChatGPT answers and found they were “more formal and analytical” while portraying “less negative sentiment” — the sort of bland and cheery tone AI tends to produce.

The Purdue researchers polled 12 programmers — admittedly a small sample size — and found they preferred ChatGPT at a rate of 35 percent and didn’t catch AI-generated mistakes at 39 percent.

The study demonstrates that ChatGPT still has major flaws — but that’s cold comfort to people laid off from Stack Overflow or programmers who have to fix AI-generated mistakes in code.

The original article contains 340 words, the summary contains 199 words. Saved 41%. I’m a bot and I’m open source!

It’s been a tremendous help to me as I relearn how to code on some personal projects. I have written 5 little apps that are very useful to me for my hobbies.

It’s also been helpful at work with some random database type stuff.

But it definitely gets stuff wrong. A lot of stuff.

The funny thing is, if you point out its mistakes, it often does better on subsequent attempts. It’s more like an iterative process of refinement than one prompt gives you the final answer.

It’s incredibly useful for learning. ChatGPT was what taught me to unlearn, essentially, writing C in every language, and how to write idiomatic Python and JavaScript.

It is very good for boilerplate code or fleshing out a big module without you having to do the typing. My experience was just like yours; once you’re past a certain (not real high) level of complexity you’re looking at multiple rounds of improvement or else just doing it yourself.

Exactly. And for me, being in middle age, it’s a big help with recalling syntax. I generally know how to do stuff, but need a little refresher on the spelling, parameters, etc.

It is very good for boilerplate code

Personally I find all LLMs in general not that great at writing larger blocks of code. It’s fine for smaller stuff, but the more you expect out of it the more it’ll get wrong.

I find they work best with existing stuff that you provide. Like “make this block of code more efficient” or “rewrite this function to do X”.

The funny thing is, if you point out its mistakes, it often does better on subsequent attempts.

Or it get stuck in an endless loop of two different but wrong solutions.

Me: This is my system, version x. I want to achieve this.

ChatGpt: Here’s the solution.

Me: But this only works with Version y of given system, not x

ChatGpt: <Apology> Try this.

Me: This is using a method that never existed in the framework.

ChatGpt: <Apology> <Gives first solution again>

“Oh, I see the problem. In order to correct (what went wrong with the last implementation), we can (complete code re-implementation which also doesn’t work)”
Goto 1

Ha! That definitely happens sometimes, too.

But only sometimes. Not often enough that I don’t still find it more useful than not.

I used to have this issue more often as well. I’ve had good results recently by **not ** pointing out mistakes in replies, but by going back to the message before GPT’s response and saying “do not include y.”

Agreed, I send my first prompt, review the output, smack my head “obviously it couldn’t read my mind on that missing requirement”, and go back and edit the first prompt as if I really was a competent and clear communicator all along.

It’s actually not a bad strategy because it can make some adept assumptions that may have seemed pertinent to include, so instead of typing out every requirement you can think of, you speech-to-text* a half-assed prompt and then know exactly what to fix a few seconds later.

*[ad] free Ecco Dictate on iOS, TypingMind’s built-in dictation… anything using OpenAI Whisper, godly accuracy. btw TypingMind is great - stick in GPT-4o & Claude 3 Opus API keys and boom

While explaining BTRFS I’ve seen ChatGPT contradict itself in the middle of a paragraph. Then when I call it out it apologizes and then contradicts itself again with slightly different verbiage.

This is because all LLMs function primarily based on the token context you feed it.

The best way to use any LLM is to completely fill up it’s history with relevant context, then ask your question.

I worked on a creative writing thing with it and the more I added, the better its responses. And 4 is a noticeable improvement over 3.5.

I was recently asked to make a small Android app using flutter, which I had never touched before

I used chatgpt at first and it was so painful to get correct answers, but then made an agent or whatever it’s called where I gave it instructions saying it was a flutter Dev and gave it a bunch of specifics about what I was working on

Suddenly it became really useful…I could throw it chunks of code and it would just straight away tell me where the error was and what I needed to change

I could ask it to write me an example method for something that I could then easily adapt for my use

One thing I would do would be ask it to write a method to do X, while I was writing the part that would use that method.

This wasn’t a big project and the whole thing took less than 40 hours, but for me to pick up a new language, setup the development environment, and make a working app for a specific task in 40 hours was a huge deal to me… I think without chatgpt, just learning all the basics and debugging would have taken more than 40 hours alone

AI Defenders! Assemble!

No need to defend it.

Either it’s value is sufficient that businesses can make money by implementing it and it gets used, or it isn’t.

I’m personally already using it to make money, so I suspect it’s going to stick around.

What’s especially troubling is that many human programmers seem to prefer the ChatGPT answers. The Purdue researchers polled 12 programmers — admittedly a small sample size — and found they preferred ChatGPT at a rate of 35 percent and didn’t catch AI-generated mistakes at 39 percent.

Why is this happening? It might just be that ChatGPT is more polite than people online.

It’s probably more because you can ask it your exact question (not just search for something more or less similar) and it will at least give you a lead that you can use to discover the answer, even if it doesn’t give you a perfect answer.

Also, who does a survey of 12 people and publishes the results? Is that normal?

Even this Lemmy thread has more participants than the survey

I have 13 friends who are researchers and they publish surveys like that all the time.

(You can trust this comment because I peer reviewed it.)

This is the official technology community of Lemmy.ml for all news related to creation and use of technology, and to facilitate civil, meaningful discussion around it.

Ask in DM before posting product reviews or ads. All such posts otherwise are subject to removal.

Rules:

1: All Lemmy rules apply

2: Do not post low effort posts

3: NEVER post naziped*gore stuff

4: Always post article URLs or their archived version URLs as sources, NOT screenshots. Help the blind users.

5: personal rants of Big Tech CEOs like Elon Musk are unwelcome (does not include posts about their companies affecting wide range of people)

6: no advertisement posts unless verified as legitimate and non-exploitative/non-consumerist

7: crypto related posts, unless essential, are disallowed

Study Finds That 52 Percent of ChatGPT Answers to Programming Questions Are Wrong(futurism.com)

Technology

!technology@lemmy.ml

Community stats

Community moderators