llama.cpp DeepSeek v4 Flash experimental inference

April 26, 2026 at 10:20 AM · 0 reactions · 0 comments · 0 views

Hi, here you can find experimental llama.cpp support for DeepSeek v4, and here there is the GGUF you can use to run the inference with "just" (lol) 128GB of RAM. The model, even quantized at 2 bit, looks very solid in my limited testing, and the speed of 17 t/s in my MacBook M3 Max is quite interesting, I would say we are into the usable zone. What I did was to heavily quantize the routed experts to 2 bits using two different 2 bit quants to balance error and size. All the rest of the model, inc

Original article

Read full at Reddit →

Anonymous · no account needed

llama.cpp DeepSeek v4 Flash experimental inference

Discussion

More from Reddit