conduit

Will llama.cpp multislot improve speed?

· 0 reactions · 0 comments · 0 views
Will llama.cpp multislot improve speed?

I've heard mostly bad opinions about multiple slots with llama.cpp (--parallel > 1). I guess comparing to vLLM it might be worse at this, but I recently tried vLLM on 4 slots and it indeed improved the overall speed significantly (150-170tps decode on one slot llama.cpp to 400tps with 4-slot vLLM, of course when all 4 slots are used). BUT vLLM handles CPU offload poorly (or I don't know how to use it properly) and, from what I heard, doesn't work with GGUFS too good, and thus, limits the availab

Original article
Reddit
Read full at Reddit →
Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Email

Discussion

More from Reddit