Speculative decoding with Gemma-4-31B + Gemma-4-E2B enables 120 - 200 tok/s output speed for specific tasks

April 26, 2026 at 1:13 PM · 0 reactions · 0 comments · 2 views

So for my project I was using up until now either Gemini 3 / 2.5 Flash or Flash-lite. All my use cases are not agentic, simply LLM workflows for atomic tasks like extracting references from the law, classifying, adjusting titles to nominative case and so on. All this happens in non-English (LT) language, that's one of the reasons I originally used Google models, as multilingual quality is very great for small base languages. Each single request usually fits in 2k - 6k tokens context. Recently I

Original article

Read full at Reddit →

Anonymous · no account needed

Speculative decoding with Gemma-4-31B + Gemma-4-E2B enables 120 - 200 tok/s output speed for specific tasks

Discussion

More from Reddit