Kubernetes + AI Workloads: SRE at Its Limit | Boottify Blog

Kubernetes was designed for stateless web services. You define a replica count, set CPU and memory requests, configure an HPA, and let the scheduler handle the rest. This model has served us well for a decade. But AI workloads are breaking every assumption it was built on, and SRE teams are scrambling to adapt.

As Komodor's "7 Kubernetes Predictions for 2026" report and The New Stack's AI inference analysis document, we're entering an era where the fundamental SRE practices — capacity planning, autoscaling, resource management, and cost attribution — need to be rethought for GPU-accelerated, memory-intensive, bursty AI workloads.

WHERE TRADITIONAL SRE BREAKS DOWN

1. GPU Scheduling Is Nothing Like CPU Scheduling

CPUs are fungible. Any pod can run on any node with enough CPU millicores. GPUs are not. A training job needs 8x A100s with NVLink interconnect. An inference service needs 1x L4 with at least 24GB VRAM. A fine-tuning job needs 4x H100s in the same pod. The Kubernetes scheduler has no native understanding of GPU topology, memory bandwidth, or interconnect requirements.

# Traditional: Simple resource request
resources:
  requests:
    cpu: "2"
    memory: "4Gi"

# AI workload: Complex GPU requirements
resources:
  requests:
    cpu: "8"
    memory: "64Gi"
    nvidia.com/gpu: "4"
  # Plus: GPU type, VRAM, interconnect, topology —
  # none of which K8s handles natively
  # Requires: device plugins, topology-aware scheduling,
  # custom schedulers

The result: SRE teams are building custom scheduling extensions, using tools like Volcano and Kueue, and fighting with node affinity rules to get GPU workloads placed correctly.

2. Autoscaling Doesn't Work for Inference

Traditional HPA watches CPU utilization and scales replicas. AI inference has completely different scaling characteristics:

Cold start is minutes, not seconds — Loading a 70B parameter model into GPU memory takes 2-5 minutes. By the time a new replica is ready, the traffic burst is over
GPU utilization is misleading — A GPU can be 90% utilized with batch processing and still have room for more requests with smart batching
Memory is the bottleneck, not compute — You can't "scale up" GPU memory. If a model doesn't fit, you need a bigger GPU or model parallelism
Traffic is extremely bursty — AI inference traffic doesn't follow web traffic patterns. A single upstream event can spike requests 100x in seconds

3. Storage for Models Is a Different Problem

Web services need maybe a few GB of container image. AI models need terabytes:

Model weights — A single Llama 70B model is ~140GB in FP16. Multiple model versions for A/B testing means hundreds of GB per node
Model loading time — Pulling 140GB from a container registry on every pod start is a non-starter. Models need to be pre-loaded on nodes or served from high-speed network storage
Checkpoint storage — Training jobs produce checkpoints that can be 500GB+ each. Managing checkpoint lifecycle is a new SRE responsibility

HOW SRE IS ADAPTING

1. Predictive Scaling Instead of Reactive

Since cold starts make reactive autoscaling useless for inference, SRE teams are moving to predictive scaling. Instead of scaling on current metrics, they scale on predicted demand:

Time-series forecasting on request patterns
Upstream event detection (if a marketing email goes out, pre-scale inference endpoints)
Baseline warm pools that keep minimum GPU capacity ready at all times

2. Cost Attribution Gets Granular

A single A100 GPU costs $1-3/hour. When multiple teams share a GPU cluster, cost attribution becomes critical. SRE teams are implementing:

Per-model cost tracking — Which model consumes how much GPU time?
Request-level cost allocation — How much does a single inference request cost?
Idle cost penalties — GPUs sitting idle because a team over-requested capacity

# GPU cost attribution per namespace
┌──────────────┬─────────┬──────────┬──────────┐
│ Namespace    │ GPU-hrs │ Cost     │ Idle %   │
├──────────────┼─────────┼──────────┼──────────┤
│ ml-training  │ 2,400   │ $4,800   │ 12%      │
│ inference-v1 │ 720     │ $1,440   │ 45%   ⚠ │
│ fine-tuning  │ 168     │ $336     │ 8%       │
│ experiments  │ 96      │ $192     │ 67%   🔴 │
└──────────────┴─────────┴──────────┴──────────┘

3. Model Serving as a Platform Service

Instead of every team deploying their own inference infrastructure, SRE teams are building centralized model serving platforms. Tools like vLLM, Triton Inference Server, and TGI provide optimized inference with features that individual teams shouldn't have to build:

Continuous batching — Dynamically group requests to maximize GPU utilization
KV cache management — Efficient memory management for transformer attention caches
Model routing — Route requests to the right model version based on headers or content
Graceful degradation — Fall back to smaller models under load instead of returning errors

4. Observability for GPU Workloads

Traditional metrics (CPU, memory, network) don't capture what matters for GPU workloads. SRE teams need new signals:

GPU utilization per SM — Not just overall GPU %, but streaming multiprocessor-level utilization
VRAM usage and fragmentation — Memory pressure that causes OOM kills
Inference latency distributions — P50 might be fine while P99 is unacceptable due to batching
Tokens per second — The fundamental throughput metric for LLM inference
Time to first token (TTFT) — User-perceived latency for streaming responses

THE PATH FORWARD

Kubernetes isn't going away — it's evolving. Projects like Kueue for job queuing, Volcano for batch scheduling, and the DRA (Dynamic Resource Allocation) KEP are adding the primitives that AI workloads need. But the SRE discipline needs to evolve too.

The SRE of 2026 needs to understand GPU topology, model serving patterns, and ML training lifecycle — not just uptime and latency. The teams that adapt their SRE practices to AI workloads will be the ones running reliable, cost-effective AI in production. The teams that try to force AI workloads into web-service-shaped SRE practices will fight a losing battle against GPU scheduling, cold starts, and cost overruns.

WHERE TRADITIONAL SRE BREAKS DOWN

1. GPU Scheduling Is Nothing Like CPU Scheduling

# Traditional: Simple resource request
resources:
  requests:
    cpu: "2"
    memory: "4Gi"

# AI workload: Complex GPU requirements
resources:
  requests:
    cpu: "8"
    memory: "64Gi"
    nvidia.com/gpu: "4"
  # Plus: GPU type, VRAM, interconnect, topology —
  # none of which K8s handles natively
  # Requires: device plugins, topology-aware scheduling,
  # custom schedulers

The result: SRE teams are building custom scheduling extensions, using tools like Volcano and Kueue, and fighting with node affinity rules to get GPU workloads placed correctly.

2. Autoscaling Doesn't Work for Inference

Traditional HPA watches CPU utilization and scales replicas. AI inference has completely different scaling characteristics:

Cold start is minutes, not seconds — Loading a 70B parameter model into GPU memory takes 2-5 minutes. By the time a new replica is ready, the traffic burst is over
GPU utilization is misleading — A GPU can be 90% utilized with batch processing and still have room for more requests with smart batching
Memory is the bottleneck, not compute — You can't "scale up" GPU memory. If a model doesn't fit, you need a bigger GPU or model parallelism
Traffic is extremely bursty — AI inference traffic doesn't follow web traffic patterns. A single upstream event can spike requests 100x in seconds

3. Storage for Models Is a Different Problem

Web services need maybe a few GB of container image. AI models need terabytes:

Model weights — A single Llama 70B model is ~140GB in FP16. Multiple model versions for A/B testing means hundreds of GB per node
Model loading time — Pulling 140GB from a container registry on every pod start is a non-starter. Models need to be pre-loaded on nodes or served from high-speed network storage
Checkpoint storage — Training jobs produce checkpoints that can be 500GB+ each. Managing checkpoint lifecycle is a new SRE responsibility

HOW SRE IS ADAPTING

1. Predictive Scaling Instead of Reactive

Since cold starts make reactive autoscaling useless for inference, SRE teams are moving to predictive scaling. Instead of scaling on current metrics, they scale on predicted demand:

Time-series forecasting on request patterns
Upstream event detection (if a marketing email goes out, pre-scale inference endpoints)
Baseline warm pools that keep minimum GPU capacity ready at all times

2. Cost Attribution Gets Granular

A single A100 GPU costs $1-3/hour. When multiple teams share a GPU cluster, cost attribution becomes critical. SRE teams are implementing:

Per-model cost tracking — Which model consumes how much GPU time?
Request-level cost allocation — How much does a single inference request cost?
Idle cost penalties — GPUs sitting idle because a team over-requested capacity

# GPU cost attribution per namespace
┌──────────────┬─────────┬──────────┬──────────┐
│ Namespace    │ GPU-hrs │ Cost     │ Idle %   │
├──────────────┼─────────┼──────────┼──────────┤
│ ml-training  │ 2,400   │ $4,800   │ 12%      │
│ inference-v1 │ 720     │ $1,440   │ 45%   ⚠ │
│ fine-tuning  │ 168     │ $336     │ 8%       │
│ experiments  │ 96      │ $192     │ 67%   🔴 │
└──────────────┴─────────┴──────────┴──────────┘

3. Model Serving as a Platform Service

Continuous batching — Dynamically group requests to maximize GPU utilization
KV cache management — Efficient memory management for transformer attention caches
Model routing — Route requests to the right model version based on headers or content
Graceful degradation — Fall back to smaller models under load instead of returning errors

4. Observability for GPU Workloads

Traditional metrics (CPU, memory, network) don't capture what matters for GPU workloads. SRE teams need new signals:

GPU utilization per SM — Not just overall GPU %, but streaming multiprocessor-level utilization
VRAM usage and fragmentation — Memory pressure that causes OOM kills
Inference latency distributions — P50 might be fine while P99 is unacceptable due to batching
Tokens per second — The fundamental throughput metric for LLM inference
Time to first token (TTFT) — User-perceived latency for streaming responses

Kubernetes + AI Workloads: How SRE Is Being Pushed to Its Limit

WHERE TRADITIONAL SRE BREAKS DOWN

1. GPU Scheduling Is Nothing Like CPU Scheduling

2. Autoscaling Doesn't Work for Inference

3. Storage for Models Is a Different Problem

HOW SRE IS ADAPTING

1. Predictive Scaling Instead of Reactive

2. Cost Attribution Gets Granular

3. Model Serving as a Platform Service

4. Observability for GPU Workloads

THE PATH FORWARD

Related Articles

Boottify's Testing Philosophy — TDD, Playwright, and Systematic Debugging

TypeScript at Scale — 248K Lines Without a Single 'any'

Authentication at Boottify — Lucia v3, OAuth, 2FA, and Session Guards

Comments

In This Article

Actions

Kubernetes + AI Workloads: How SRE Is Being Pushed to Its Limit

WHERE TRADITIONAL SRE BREAKS DOWN

1. GPU Scheduling Is Nothing Like CPU Scheduling

2. Autoscaling Doesn't Work for Inference

3. Storage for Models Is a Different Problem

HOW SRE IS ADAPTING

1. Predictive Scaling Instead of Reactive

2. Cost Attribution Gets Granular

3. Model Serving as a Platform Service

4. Observability for GPU Workloads

THE PATH FORWARD

Related Articles

Boottify's Testing Philosophy — TDD, Playwright, and Systematic Debugging

TypeScript at Scale — 248K Lines Without a Single 'any'

Authentication at Boottify — Lucia v3, OAuth, 2FA, and Session Guards

Comments

In This Article

Actions

WHERE TRADITIONAL SRE BREAKS DOWN

1. GPU Scheduling Is Nothing Like CPU Scheduling

2. Autoscaling Doesn't Work for Inference

3. Storage for Models Is a Different Problem

HOW SRE IS ADAPTING

1. Predictive Scaling Instead of Reactive

2. Cost Attribution Gets Granular

3. Model Serving as a Platform Service

4. Observability for GPU Workloads

THE PATH FORWARD

Enjoyed this article?

Related Articles

Boottify's Testing Philosophy — TDD, Playwright, and Systematic Debugging

TypeScript at Scale — 248K Lines Without a Single 'any'

Authentication at Boottify — Lucia v3, OAuth, 2FA, and Session Guards

Comments

In This Article

Actions

WHERE TRADITIONAL SRE BREAKS DOWN

1. GPU Scheduling Is Nothing Like CPU Scheduling

2. Autoscaling Doesn't Work for Inference

3. Storage for Models Is a Different Problem

HOW SRE IS ADAPTING

1. Predictive Scaling Instead of Reactive

2. Cost Attribution Gets Granular

3. Model Serving as a Platform Service

4. Observability for GPU Workloads

THE PATH FORWARD

Enjoyed this article?

Related Articles

Boottify's Testing Philosophy — TDD, Playwright, and Systematic Debugging

TypeScript at Scale — 248K Lines Without a Single 'any'

Authentication at Boottify — Lucia v3, OAuth, 2FA, and Session Guards

Comments

In This Article

Actions