Red Hat’s contribution of llm-d to the CNCF Sandbox makes Kubernetes a first-class platform for LLM inference, not just a “good enough” runtime.[1]
By treating accelerators, topology, and KV cache as programmable resources, llm-d turns existing Kubernetes clusters into shared AI fabrics instead of isolated inference stacks.[4][7]
💡 Key idea: llm-d makes LLM inference a cloud native workload governed by open standards and CNCF processes, not vendor-specific systems.[1]
1. Why llm-d Matters for Kubernetes and CNCF
llm-d’s CNCF Sandbox status anchors LLM inference in neutral, open governance similar to Kubernetes itself.[1]
- Ensures APIs, patterns, and scheduling semantics evolve under Linux Foundation stewardship.
- Reduces lock-in risk versus proprietary inference platforms.
The project’s origins highlight broad neutrality:
- Launched in May 2025 by Red Hat, Google Cloud, IBM Research, CoreWeave, and NVIDIA.[1]
- Expanded to AMD, Cisco, Hugging Face, Intel, Lambda, Mistral AI, and universities.[1][10]
- Signals alignment on a shared, Kubernetes-native inference approach.
⚡ Strategic shift: Designed for “any model, any accelerator, any cloud,” targeting heterogeneous, multi-cloud clusters with GPUs, TPUs, and custom ASICs.[1][3]
llm-d is:
- A vehicle to evolve Kubernetes into state-of-the-art AI infrastructure.[1]
- Focused on production serving: performance per dollar, multi-tenancy, and SLOs.[7][9]
- Aimed at platform/DevOps teams, not just researchers.
💼 Section takeaway: With llm-d in CNCF, Kubernetes becomes the default place to standardize LLM serving, scheduling, and optimization across vendors and clouds.
2. Core Architecture: Distributed Inference Built for Kubernetes
llm-d provides a Kubernetes-native architecture for distributed inference, built on vLLM plus an inference scheduler, cache-aware routing, and disaggregated serving.[2][7] It embeds into Kubernetes rather than replacing it.
Disaggregated prefill and decode
Inference is split into two phases:
- Prefill: Compute-heavy, builds KV cache for input tokens.
- Decode: Memory-bandwidth-bound, consumes KV cache to generate tokens.[8]
llm-d can run these on different replicas and accelerator types, so GPUs are used where they matter instead of over-provisioning every pod.[3][8]
flowchart LR
A[Client Request] --> B[Inference Gateway]
B --> C[Prefill Servers]
C --> D[KV Cache Store]
D --> E[Decode Servers]
E --> F[Response Stream]
style C fill:#22c55e,color:#fff
style E fill:#0ea5e9,color:#fff
📊 Architecture insight: Disaggregation replaces “one big GPU per pod” with a tunable pipeline per phase, workload, and accelerator.[3][8]
Integration with Inference Gateway
llm-d integrates with the Kubernetes Inference Gateway (IGW):[2][7]
- Applications call a stable gateway API.
- Platform teams optimize routing, placement, and scaling internally.
- Models, policies, and accelerator layouts can change without touching app code.
Topology-aware scheduling
The scheduler understands:
- GPU peer-to-peer connectivity
- NUMA layout and local memory bandwidth
- Network fabrics and cross-node bandwidth[3][10]
Using this topology, llm-d:
- Routes requests to meet latency SLOs at lowest cost.
- Avoids naive balancing by CPU or generic utilization.[3][10]
Guides and Helm recipes provide “well-lit paths” for deploying llm-d across tens or hundreds of nodes, single- or multi-tenant.[9]
⚠️ Section takeaway: llm-d makes inference architecture a native Kubernetes concern, combining vLLM, IGW, and topology-aware scheduling into a reproducible stack.
3. Performance and Cost Optimizations for Enterprise LLMs
llm-d focuses on levers that determine whether LLMs are economically viable at scale.
KV cache aware routing
KV cache aware routing sends follow-up or similar prompts to cache-warm nodes, avoiding repeated prefill work.[2][7]
- Especially valuable for multi-step prompts, agents, and RAG.
- Reduces tail latency and jitter.
flowchart LR
A[New Prompt] --> B{Cache Hit?}
B -- Yes --> C[Route to Warm Node]
B -- No --> D[Route to Any Node]
C --> E[Low Latency Response]
D --> F[Prefill + Decode]
style C fill:#22c55e,color:#fff
style F fill:#f59e0b,color:#fff
📊 Practical effect: Users see better latency from cache-warm routing and higher GPU utilization by assigning accelerators to specific pipeline stages instead of cloning full stacks per replica.[2]
Disaggregated serving and workload-aware scheduling
Separating prefill and decode lets llm-d:
- Reduce duplicate model state replication.[2][8]
- Assign hardware by workload shape (short chat, long-context, large batch).[3][8]
- Improve:
llm-d is tuned for:
These high-value enterprise patterns stress cache management and scheduling.
Vendors like Mistral AI note that next-gen models (e.g., Mixture of Experts) require robust KV cache management and disaggregated serving—exactly llm-d’s focus.[1]
💡 Section takeaway: llm-d exposes cache locality and phase-aware scheduling as explicit controls, turning raw accelerator capacity into better latency and lower cost for real workloads.
4. Multi-Accelerator and Topology-Aware Inference
The same mechanisms also let llm-d treat heterogeneous hardware as one programmable pool. Modern clusters often mix:
- High-end GPUs for interactive chat
- Memory-rich accelerators for long-context reasoning
- Custom ASICs/TPUs for batch or offline jobs[3]
llm-d offers:
- A unified recipe and scheduler that understands accelerator classes.
- Hardware selection based on workload pattern, not manual guesswork.[3]
Topology and interconnect awareness
llm-d surfaces interconnect details—from NUMA layouts to network fabrics and GPU peer-to-peer bandwidth—so communication-heavy workloads land where overhead is minimized.[3][10]
Expressed via Kubernetes primitives:
- Node labels/taints for accelerator type and topology
- Affinity/anti-affinity and scheduling constraints
- Standard observability for monitoring hot paths[3][9]
flowchart TB
A[Workload Type] --> B{Chat}
A --> C{Long Context}
A --> D{Batch}
B --> E[Low-latency GPUs]
C --> F[High-memory Nodes]
D --> G[Cost-optimized ASICs]
style E fill:#22c55e,color:#fff
style F fill:#0ea5e9,color:#fff
style G fill:#f59e0b,color:#fff
📊 Planning aid: Platform teams get a practical scorecard for mixing accelerators by workload—chat, long-context, batch—rather than guessing hardware purchases and placement.[3]
This multi-accelerator strategy aligns with industry trends: GPU and CPU vendors back llm-d so their hardware participates in a standardized, open inference stack.[1][10]
⚡ Section takeaway: llm-d turns heterogeneous hardware and complex topology into declarative scheduling inputs, enabling portable, vendor-neutral AI fabrics.
5. Adoption Path: From First Cluster to Production Platform
llm-d pairs advanced capabilities with a realistic adoption path.
From quickstart to optimized platforms
Official guides and Helm charts provide:
- Tested, benchmarked recipes for high-performance deployments.[9]
- Requirements: only basic Kubernetes familiarity.
- Targets:
- Single-model deployments across tens/hundreds of nodes
- Multi-tenant model-as-a-service platforms sharing deployments[9]
The “well-lit path” includes curated configs for:
- Intelligent inference scheduling
- Prefill/decode disaggregation
- KV cache aware routing[9]
flowchart LR
A[Quickstart Cluster] --> B[Intelligent Scheduling]
B --> C[Prefill/Decode Split]
C --> D[KV Cache Tests]
D --> E[Multi-tenant Platform]
style A fill:#e5e7eb
style E fill:#22c55e,color:#fff
Red Hat’s guidance helps teams:
- Validate KV cache aware routing.
- Measure latency and cost improvements against their own workloads.[7][8]
Community-driven evolution
Cloud Native FM discussions with Red Hat engineers frame llm-d as:
- A practical toolset that strengthens Kubernetes for enterprise LLM inference, not a silver bullet.[2]
- A CNCF Sandbox project inviting contributions from operators, vendors, and researchers.[1][7]
This ensures llm-d tracks rapid shifts in:
- Model architectures
- Accelerator types
- Workload patterns
💼 Section takeaway: With opinionated docs, Helm recipes, and open governance, llm-d offers a low-friction path from first experiment to production-grade, multi-tenant LLM platforms.
Conclusion: Turning Kubernetes into an AI Fabric
By contributing llm-d to CNCF, Red Hat and partners are defining a Kubernetes-native, vendor-neutral standard for distributed LLM inference across accelerators, topologies, and clouds.[1][3][7]
Platform teams can manage GPUs, KV caches, and cluster fabric as programmable resources within the same ecosystem that standardized containers and microservices.
⚡ Call to action:
Platform teams should:
- Pilot llm-d using official guides and Helm recipes.[9]
- Benchmark KV cache aware routing and disaggregated serving against current stacks.[8]
- Engage with the CNCF llm-d community to influence features and roadmap as generative AI evolves.[2]
Early adopters will help shape—and benefit from—the next generation of cloud native AI infrastructure.
Sources & References (6)
- 1Welcome llm-d to the CNCF: Evolving Kubernetes into SOTA AI infrastructure | CNCF
Welcome llm-d to the CNCF: Evolving Kubernetes into SOTA AI infrastructure Posted on March 24, 2026 by Carlos Costa (IBM Research), Clayton Coleman (Google), and Rob Shaw (Red Hat) We are thrilled t...
- 2How to Run LLMs on Kubernetes with llm-d: A Distributed Inference Stack
Saim Safder – Cloud Native FM on LinkedIn Is Kubernetes enough to run enterprise LLMs? It’s close, but only when paired with purpose-built layers. In this episode of Cloud Native FM, we introduce llm...
- 3Llm-d: Multi-Accelerator LLM Inference on Kubernetes - Erwan Gallen, Red Hat
Llm-d: Multi-Accelerator LLM Inference on Kubernetes - Erwan Gallen, Red Hat Large language model serving has grown beyond one GPU per pod. Kubernetes clusters now mix GPUs, TPUs and custom AI ASICs,...
- 4Getting started with llm-d for distributed AI inference
Getting started with llm-d for distributed AI inference llm-d: Kubernetes-native distributed inference stack for large-scale LLM applications August 19, 2025 Cedric Clyburn, Philip Hayes Related t...
- 5Guides | llm-d
Our guides provide tested and benchmarked recipes and Helm charts to serve large language models (LLMs) at peak performance with best practices common to production deployments. A familiarity with bas...
- 6Deploying llm-d in Kubernetes: The Future of Distributed AI Inference at Scale
# Deploying llm-d in Kubernetes: The Future of Distributed AI Inference at Scale Introduction llm-d is a new open source community project designed to enable scalable distributed generative AI infer...
Generated by CoreProse in 1m 43s
What topic do you want to cover?
Get the same quality with verified sources on any subject.