llama.cpp DeepSeek v4 Flash experimental inference
·
0 reactions
·
0 comments
·
0 views
Hi, here you can find experimental llama.cpp support for DeepSeek v4, and here there is the GGUF you can use to run the inference with "just" (lol) 128GB of RAM. The model, even quantized at 2 bit, looks very solid in my limited testing, and the speed of 17 t/s in my MacBook M3 Max is quite interesting, I would say we are into the usable zone. What I did was to heavily quantize the routed experts to 2 bits using two different 2 bit quants to balance error and size. All the rest of the model, inc
Original article
Reddit
Anonymous · no account needed