Comment by FaceDeer@kbin.social on GPT-4 is getting worse over time, not better.

This weakness in ChatGPT and its siblings isn’t surprising when viewed in light of how these large language models work. They don’t actually “see” the things we type as a string of characters. Instead, there’s an intermediate step where what we type gets turned into “tokens.” Tokens are basically numbers that index words in a giant dictionary that the LLM knows. So if I type “How do you spell apple?” It gets turned into [2437, 466, 345, 4822, 17180, 30]. The “17180” in there is the token ID for “apple.” Note that that token is for “apple” with a lower case; the token for “Apple” is “16108”. The LLM “knows” that 17180 and 16108 usually mean the same thing, but that 16108 is generally used at the beginning of a sentence or when referring to the computer company. It doesn’t inherently know how they’re actually spelled or capitalized unless there was information in its training data about that.

You can play around with OpenAI’s tokenizer here to see more of this sort of thing. Click the “show example” button for some illustrative text, using the “text” view at the bottom instead of “token id” to see how it gets chopped up.

Our Rules

Follow the lemmy.world rules.

Only tech related content.

Be excellent to each another!

Mod approved content bots can post up to 10 articles per day.

Threads asking for personal tech support may be deleted.

Politics threads may be removed.

No memes allowed as posts, OK to post as comments.

Only approved bots from the list below, to ask if your bot can be added please contact us.

Check for duplicates before posting, duplicates may be removed

GPT-4 is getting worse over time, not better.(twitter.com)

Technology

!technology@lemmy.world

Our Rules

Approved Bots

Community stats

Community moderators