Skip to main content

AI Blueprint

Purpose: For platform engineers, explains how the AI blueprint extends the platform foundation with GPU infrastructure patterns, policy controls, and consistent operations for AI/ML workloads.

Overview

AI workloads do not need a separate platform. They need GPUs, policy controls, and an ops model that does not fall apart when you scale past the demo. This blueprint provides infrastructure patterns designed for AI workloads — not generic compute with a GPU driver bolted on.

Status: Preview

What You Get

  1. GPU-Ready From the Start — NVIDIA GPU Operator, node labeling, taints, and scheduling constraints configured for GPU workloads.
  2. Policy Before Production — Security and compliance controls ship with the cluster, not added after the first audit finding.
  3. Same Ops, Every Environment — Dev, staging, production use the same workflows, same tooling, same confidence level.

Capabilities

GPU Infrastructure

ComponentPurpose
NVIDIA GPU OperatorAutomated GPU driver and runtime management
Node labelingGPU nodes labeled for targeted scheduling
Taints/tolerationsGPU nodes reserved for GPU workloads
Resource quotasPrevent GPU hoarding across teams/namespaces
Scheduling constraintsTopology-aware placement for multi-GPU workloads

Security and Compliance

All platform foundation security applies unchanged:

  • Kyverno policies enforce image sources, resource limits, security contexts
  • Pod Security Admission prevents privilege escalation
  • RBAC Manager controls namespace access per team
  • SOPS encryption for model credentials and API keys
  • Harbor scans GPU workload images for vulnerabilities

Operational Consistency

  • Same GitOps workflow (FluxCD) for GPU workloads as any other service
  • Prometheus metrics for GPU utilization, memory, temperature
  • Loki for training job logs
  • Alertmanager for GPU health alerts
  • Velero backup includes GPU workload state

Current Status

The AI blueprint is in Preview. Available today:

  • GPU Operator deployment via gitops-base
  • Node labeling and taint configuration via Kubespray inventory
  • Scheduling constraints and resource quotas
  • Full observability and security from platform foundation

Planned additions (no committed timeline):

  • Kyverno policies specific to AI workloads (GPU resource validation)
  • Grafana dashboards for GPU cluster utilization
  • Air-gap packaging for GPU operator and driver images

Composition

Further Reading