So I signed up for a free month of their crap because I wanted to test if it solves novel variants of the river crossing puzzle.

Like this one:

You have a duck, a carrot, and a potato. You want to transport them across the river using a boat that can take yourself and up to 2 other items. If the duck is left unsupervised, it will run away.

Unsurprisingly, it does not:

https://g.co/gemini/share/a79dc80c5c6c

https://g.co/gemini/share/59b024d0908b

The only 2 new things seem to be that old variants are no longer novel, and that it is no longer limited to producing incorrect solutions - now it can also incorrectly claim that the solution is impossible.

I think chain of thought / reasoning is a fundamentally dishonest technology. At the end of the day, just like older LLMs it requires that someone solved a similar problem (either online or perhaps in a problem solution pair they generated if they do that to augment the training data).

But it outputs quasi reasoning to pretend that it is actually solving the problem live.

You are viewing a single thread.
View all comments View context
-12 points

It’s just overtrained on the puzzle such that it mostly ignores your prompt. Changing a few words out doesn’t change that it recognises the puzzle. Try writing it out in ASCII or uploading an image with it written or some other weird way that it hasn’t been specifically trained on and I bet it actually performs better.

permalink
report
parent
reply
18 points

oh look it’s a loadbearing “just” in the wild. better hope you can shore that fucker up with some facts

Try writing it out in ASCII

my poster in christ, what in the fuck are you on about. stop prompting LLMs and go learn some things instead

some other weird way that it hasn’t been specifically trained on and I bet it actually performs better

“no no see, you just need to prompt it different. just prompt it different bro it’ll work bro I swear bro”

god, every fucking time

permalink
report
parent
reply
10 points

All along my mistake was that I was prompting it in unicode instead of latin1, alphameric BCD, or “modified UTF-8”.

permalink
report
parent
reply
7 points

I thought everyone knew that you had to structure prompts in ALGOL 420 to get the best performance by going close to the metal

permalink
report
parent
reply
8 points

Well has anyone tried prompting it in EBCDIC? How do we know doing so won’t immediately create the super intelligence that "or whatever"s us to silicon Valhalla? Asking for a friend.

permalink
report
parent
reply
5 points

you know, I was briefly considering trying to, and I figured you’d probably have to be forcing it by content escaping tricks or something (at least I presume their APIs will do basic type-checking…)

got other yaks to do atm tho

permalink
report
parent
reply
18 points

I bet it generates stochastic nonsense you’ll read like tea leaves

permalink
report
parent
reply
-8 points

Bet

permalink
report
parent
reply
12 points

The accumulated filth of all their slop and murder will foam up about their waists and all the whores and prompt enjoyers will look up and shout: ‘Bet’ - and I’ll whisper ‘no.’

permalink
report
parent
reply
9 points

Butt

( Y )

permalink
report
parent
reply
14 points

write it out in ASCII

My dude what do you think ASCII is? Assuming we’re using standard internet interfaces here and the request is coming in as UTF-8 encoded English text it is being written out in ASCII

Sneers aside, given that the supposed capability here is examining a text prompt and reason through the relevant information to provide a solution in the form of a text response this kind of test is, if anything, rigged in favor of the AI compared to some similar versions that add in more steps to the task like OCR or other forms of image parsing.

It also speaks to a difference in how AI pattern recognition compared to the human version. For a sufficiently well-known pattern like the form of this river-crossing puzzle it’s the changes and exceptions that jump out. This feels almost like giving someone a picture of the Mona Lisa with aviators on; the model recognizes that it’s 99% of the Mona Lisa and goes from there, rather than recognizing that the changes from that base case are significant and intentional variation rather than either a totally new thing or a ‘corrupted’ version of the original.

permalink
report
parent
reply
-8 points

Exactly. It’s overtrained on the test, ignoring the differences. If you instead used something it recognises but doesn’t recognise as the test pattern (having the same tokens/embeddings) it will perform better. I’m not joking, it’s a common tactic to get around censoring. You’re just going around the issue. What I’m saying is they’ve trained the model so much on benchmarks that it is indeed dumber.

permalink
report
parent
reply
16 points

I don’t think that the actual performance here is as important as the fact that it’s clearly not meaningfully “reasoning” at all. This isn’t a failure mode that happens if it’s actually thinking through the problem in front of it and understanding the request. It’s a failure mode that comes from pattern matching without actual reasoning.

permalink
report
parent
reply
13 points

The machine I love can’t be dumb, I love the machine and I can’t love what is dumb.

permalink
report
parent
reply
11 points

“it can’t be that stupid, you must be prompting it wrong”

permalink
report
parent
reply
7 points
*

Not really. Here’s the chain-of-word-vomit that led to the answers:

https://pastebin.com/HQUExXkX

Note that in “its impossible” answer it correctly echoes that you can take one other item with you, and does not bring the duck back (while the old overfitted gpt4 obsessively brought items back), while in the duck + 3 vegetables variant, it has a correct answer in the wordvomit, but not being an AI enthusiast it can’t actually choose the correct answer (a problem shared with the monkeys on typewriters).

I’d say it clearly isn’t ignoring the prompt or differences from the original river crossings. It just can’t actually reason, and the problem requires a modicum of reasoning, much as unloading groceries from a car does.

permalink
report
parent
reply
1 point
Deleted by creator
permalink
report
parent
reply

TechTakes

!techtakes@awful.systems

Create post

Big brain tech dude got yet another clueless take over at HackerNews etc? Here’s the place to vent. Orange site, VC foolishness, all welcome.

This is not debate club. Unless it’s amusing debate.

For actually-good tech, you want our NotAwfulTech community

Community stats

  • 1.2K

    Monthly active users

  • 681

    Posts

  • 16K

    Comments

Community moderators