The Rule(lemmy.ml)

posted 2 months ago

roon@lemmy.ml

196@lemmy.blahaj.zone

63 commentshide report

Sort:

Hot Top Controversial New Old

You are viewing a single thread.

View all comments View context

[ - ]

Pumpkin Escobar@lemmy.world

8 points

2 months ago

There’s quantization which basically compresses the model to use a smaller data type for each weight. Reduces memory requirements by half or even more.

There’s also airllm which loads a part of the model into RAM, runs those calculations, unloads that part, loads the next part, etc… It’s a nice option but the performance of all that loading/unloading is never going to be great, especially on a huge model like llama 405b

Then there are some neat projects to distribute models across multiple computers like exo and petals. They’re more targeted at a p2p-style random collection of computers. I’ve run petals in a small cluster and it works reasonably well.

permalink

report

parent

[ - ]

AdrianTheFrog@lemmy.world

1 point

2 months ago

Yes, but 200 gb is probably already with 4 bit quantization, the weights in fp16 would be more like 800 gb IDK if its even possible to quantize more, if it is, you’re probably better of going with a smaller model anyways

permalink

report

parent

196

!196@lemmy.blahaj.zone

Create post

Be sure to follow the rule before you head out.

Rule: You must post before you leave.

^other ^rules

Community stats

11K
Monthly active users
15K
Posts
171K
Comments

Community stats

Community moderators