Kubernetes was designed for stateless web services. You define a replica count, set CPU and memory requests, configure an HPA, and let the scheduler handle the rest. This model has served us well for a decade. But AI workloads are breaking every assumption it was built on, and SRE teams are scrambling to adapt.
As Komodor's "7 Kubernetes Predictions for 2026" report and The New Stack's AI inference analysis document, we're entering an era where the fundamental SRE practices — capacity planning, autoscaling, resource management, and cost attribution — need to be rethought for GPU-accelerated, memory-intensive, bursty AI workloads.
WHERE TRADITIONAL SRE BREAKS DOWN
1. GPU Scheduling Is Nothing Like CPU Scheduling
CPUs are fungible. Any pod can run on any node with enough CPU millicores. GPUs are not. A training job needs 8x A100s with NVLink interconnect. An inference service needs 1x L4 with at least 24GB VRAM. A fine-tuning job needs 4x H100s in the same pod. The Kubernetes scheduler has no native understanding of GPU topology, memory bandwidth, or interconnect requirements.
# Traditional: Simple resource request
resources:
requests:
cpu: "2"
memory: "4Gi"
# AI workload: Complex GPU requirements
resources:
requests:
cpu: "8"
memory: "64Gi"
nvidia.com/gpu: "4"
# Plus: GPU type, VRAM, interconnect, topology —
# none of which K8s handles natively
# Requires: device plugins, topology-aware scheduling,
# custom schedulers
The result: SRE teams are building custom scheduling extensions, using tools like Volcano and Kueue, and fighting with node affinity rules to get GPU workloads placed correctly.
2. Autoscaling Doesn't Work for Inference
Traditional HPA watches CPU utilization and scales replicas. AI inference has completely different scaling characteristics:
- Cold start is minutes, not seconds — Loading a 70B parameter model into GPU memory takes 2-5 minutes. By the time a new replica is ready, the traffic burst is over
- GPU utilization is misleading — A GPU can be 90% utilized with batch processing and still have room for more requests with smart batching
- Memory is the bottleneck, not compute — You can't "scale up" GPU memory. If a model doesn't fit, you need a bigger GPU or model parallelism
- Traffic is extremely bursty — AI inference traffic doesn't follow web traffic patterns. A single upstream event can spike requests 100x in seconds
3. Storage for Models Is a Different Problem
Web services need maybe a few GB of container image. AI models need terabytes:
- Model weights — A single Llama 70B model is ~140GB in FP16. Multiple model versions for A/B testing means hundreds of GB per node
- Model loading time — Pulling 140GB from a container registry on every pod start is a non-starter. Models need to be pre-loaded on nodes or served from high-speed network storage
- Checkpoint storage — Training jobs produce checkpoints that can be 500GB+ each. Managing checkpoint lifecycle is a new SRE responsibility
HOW SRE IS ADAPTING
1. Predictive Scaling Instead of Reactive
Since cold starts make reactive autoscaling useless for inference, SRE teams are moving to predictive scaling. Instead of scaling on current metrics, they scale on predicted demand:
- Time-series forecasting on request patterns
- Upstream event detection (if a marketing email goes out, pre-scale inference endpoints)
- Baseline warm pools that keep minimum GPU capacity ready at all times
2. Cost Attribution Gets Granular
A single A100 GPU costs $1-3/hour. When multiple teams share a GPU cluster, cost attribution becomes critical. SRE teams are implementing:
- Per-model cost tracking — Which model consumes how much GPU time?
- Request-level cost allocation — How much does a single inference request cost?
- Idle cost penalties — GPUs sitting idle because a team over-requested capacity
# GPU cost attribution per namespace
┌──────────────┬─────────┬──────────┬──────────┐
│ Namespace │ GPU-hrs │ Cost │ Idle % │
├──────────────┼─────────┼──────────┼──────────┤
│ ml-training │ 2,400 │ $4,800 │ 12% │
│ inference-v1 │ 720 │ $1,440 │ 45% ⚠ │
│ fine-tuning │ 168 │ $336 │ 8% │
│ experiments │ 96 │ $192 │ 67% 🔴 │
└──────────────┴─────────┴──────────┴──────────┘
3. Model Serving as a Platform Service
Instead of every team deploying their own inference infrastructure, SRE teams are building centralized model serving platforms. Tools like vLLM, Triton Inference Server, and TGI provide optimized inference with features that individual teams shouldn't have to build:
- Continuous batching — Dynamically group requests to maximize GPU utilization
- KV cache management — Efficient memory management for transformer attention caches
- Model routing — Route requests to the right model version based on headers or content
- Graceful degradation — Fall back to smaller models under load instead of returning errors
4. Observability for GPU Workloads
Traditional metrics (CPU, memory, network) don't capture what matters for GPU workloads. SRE teams need new signals:
- GPU utilization per SM — Not just overall GPU %, but streaming multiprocessor-level utilization
- VRAM usage and fragmentation — Memory pressure that causes OOM kills
- Inference latency distributions — P50 might be fine while P99 is unacceptable due to batching
- Tokens per second — The fundamental throughput metric for LLM inference
- Time to first token (TTFT) — User-perceived latency for streaming responses
THE PATH FORWARD
Kubernetes isn't going away — it's evolving. Projects like Kueue for job queuing, Volcano for batch scheduling, and the DRA (Dynamic Resource Allocation) KEP are adding the primitives that AI workloads need. But the SRE discipline needs to evolve too.
The SRE of 2026 needs to understand GPU topology, model serving patterns, and ML training lifecycle — not just uptime and latency. The teams that adapt their SRE practices to AI workloads will be the ones running reliable, cost-effective AI in production. The teams that try to force AI workloads into web-service-shaped SRE practices will fight a losing battle against GPU scheduling, cold starts, and cost overruns.



