Hi,

Just like the title says:

I’m try to run:

With:

  • koboldcpp:v1.43 using HIPBLAS on a 7900XTX / Arch Linux

Running :

--stream --unbantokens --threads 8 --usecublas normal

I get very limited output with lots of repetition.

Illustrattion

I mostly didn’t touch the default settings:

Settings

Does anyone know how I can make things run better?

EDIT: Sorry for multiple posts, Fediverse bugged out.

You are viewing a single thread.
View all comments View context
2 points

Thanks a lot for your input. It’s a lot to stomach but very descriptive which is what I need.

I run this Koboldcpp in a container.

What I ended up doing and which was semi-working is:

  • --model "/app/models/mythomax-l2-13b.ggmlv3.q5_0.bin" --port 80 --stream --unbantokens --threads 8 --contextsize 4096 --useclblas 0 0

In the Kobboldcpp UI, I set max response token to 512 and switched to an Instruction/response model and kept prompting with “continue the writing”, with the MythoMax model.

But I’ll be re-checking your way of doing it because the SuperCOT model seemed less streamlined and more qualitative in its story writing.

permalink
report
parent
reply
1 point
*

Alright. I have another remark: Add --ropeconfig 1.0 10000 after your --contextsize 4096.

You’re not using a ‘gguf’ file but the ggml format that was used up until a few days ago. The older ggml format doesn’t save any metadata with the model. That means KoboldCpp doesn’t know what the original context size was. And doesn’t know if it needs to scale to get to your specified 4096 tokens. I’ve read enough on the github page to know this fails. It’ll assume a default context size of 2048 (which was the correct value for LLaMA1) and use scaling to get to the 4096 tokens. But that’s wrong in this case, because in this case it’s a Llama2 model and that already has 4096 tokens context size.

‘–ropeconfig 1.0 10000’ means ‘don’t scale’ and that’s the right thing in this case.

You can verify this by looking at the debug information KoboldCpp logs on startup. When I start it with your mentioned arguments, it says:

Using automatic RoPE scaling (scale:1.000, base:32000.0)

And that is about a 2x factor, which is wrong. I’m sorry this is so complicated, you basically need to know all that stuff. But that’s the reason why we’re changing the file format to ‘gguf’ which will make that easier in the future. Just add the ‘ropeconfig’ and you’ll be fine for now.

If you want, you can learn more about ‘scaling’ here: https://github.com/LostRuins/koboldcpp/wiki#koboldcpp-general-usage-and-troubleshooting

Other than that: Have fun. I hope you get the technical details out of the way quickly so you can focus on the fun stuff. If you got any questions, feel free to ask. I always like to read what other people are up to. And their results (or interesting fails ;)

(I think 4096 is plenty for the first tries with this. If your stories get longer and you want 8192 context with a Llama2-based model like MythoMax-L2 use this: --contextsize 8192 --ropeconfig 1.0 32000 and don’t forget to also adjust the slider in the Kobold Lite Web UI.)

permalink
report
parent
reply
2 points

Don’t be sorry, you’re being so helpful, thank you a lot.

I finally replicated your config:

localhost/koboldcpp:v1.43 --port 80 --threads 4 --contextsize 8192 --useclblas 0 0 --smartcontext --ropeconfig 1.0 32000 --stream "/app/models/mythomax-l2-kimiko-v2-13b.Q5_K_M.gguf"

And had satisfying results! The performance of LLaMA2 really is nice to have here as well.

permalink
report
parent
reply
1 point
*

Looks good to me.

For reference: I think i got the settings in my screenshot from Reddit. But they seem to have updated the post since. The current recommended settings have a temperature and some other settings that are closer to what I’ve seen in the default settings. I’ve tested those (new to me) settings and they also work for me. Maybe I also adapted the settings from here.

And I’ve linked a 33b MythoMax model in the previous post that’s probably not working properly. I’ve edited that part and crossed it out. But you seem to use a 13b version anyways. That’s good.

I’ve tried a few models today. I think another promising model for writing stories is Athena. For your information: I get inspiration from this list. But beware, that’s for ERP, so erotic role play. So some models from that ranking are probably not safe for work (or for minors). But other benchmarks often test for factual knowledge and answering questions. And in my experience the models good at those things are not necessarily good at creative tasks. But that’s more my belief. I don’t know if it’s actually true. And this ranking also isn’t very scientific.

permalink
report
parent
reply

LocalLLaMA

!localllama@sh.itjust.works

Create post

Community to discuss about LLaMA, the large language model created by Meta AI.

This is intended to be a replacement for r/LocalLLaMA on Reddit.

Community stats

  • 86

    Monthly active users

  • 197

    Posts

  • 758

    Comments