Batch inference on OpenShift AI with llm-d: Architecture, integration, and workflows

Most public LLM discussion centers on interactive inference—chatbots, coding assistants, agents—where latency is the headline, but production traffic is wider. In production, teams run heavy-duty tasks: model evaluations, dataset scoring, massive embeddings for retrieval-augmented generation (RAG), backfills when policies or model versions change. These workloads are deadline-driven, not latency-s...