SMG: The Case for Disaggregating CPU from GPU in LLM Serving

How It Started: Hitting the GIL Wall at Scale We’ve been running production model serving for many years. When we first started building Shepherd Model Gateway, the goal was modest: figure out if cache-aware load balancing could improve routing across inference replicas. It could. And as we went deeper, we found a much bigger problem. In both SGLang and vLLM, tokenization and detokenization had be...

SMG: The Case for Disaggregating CPU from GPU in LLM Serving

Facts Only

Executive Summary

Full Take