How It Started: Hitting the GIL Wall at Scale
We’ve been running production model serving for many years. When we first started building Shepherd Model Gateway, the goal was modest: figure out if cache-aware load balancing could improve routing across inference replicas.
It could. And as we went deeper, we found a much bigger problem.
In both SGLang and vLLM, tokenization and detokenization had be...