AI Blueprint
Purpose: For platform engineers, explains how the AI blueprint extends the platform foundation with GPU infrastructure patterns, policy controls, and consistent operations for AI/ML workloads.
Overview
AI workloads do not need a separate platform. They need GPUs, policy controls, and an ops model that does not fall apart when you scale past the demo. This blueprint provides infrastructure patterns designed for AI workloads — not generic compute with a GPU driver bolted on.
Status: Preview
What You Get
- GPU-Ready From the Start — NVIDIA GPU Operator, node labeling, taints, and scheduling constraints configured for GPU workloads.
- Policy Before Production — Security and compliance controls ship with the cluster, not added after the first audit finding.
- Same Ops, Every Environment — Dev, staging, production use the same workflows, same tooling, same confidence level.
Capabilities
GPU Infrastructure
| Component | Purpose |
|---|---|
| NVIDIA GPU Operator | Automated GPU driver and runtime management |
| Node labeling | GPU nodes labeled for targeted scheduling |
| Taints/tolerations | GPU nodes reserved for GPU workloads |
| Resource quotas | Prevent GPU hoarding across teams/namespaces |
| Scheduling constraints | Topology-aware placement for multi-GPU workloads |
Security and Compliance
All platform foundation security applies unchanged:
- Kyverno policies enforce image sources, resource limits, security contexts
- Pod Security Admission prevents privilege escalation
- RBAC Manager controls namespace access per team
- SOPS encryption for model credentials and API keys
- Harbor scans GPU workload images for vulnerabilities
Operational Consistency
- Same GitOps workflow (FluxCD) for GPU workloads as any other service
- Prometheus metrics for GPU utilization, memory, temperature
- Loki for training job logs
- Alertmanager for GPU health alerts
- Velero backup includes GPU workload state
Current Status
The AI blueprint is in Preview. Available today:
- GPU Operator deployment via gitops-base
- Node labeling and taint configuration via Kubespray inventory
- Scheduling constraints and resource quotas
- Full observability and security from platform foundation
Planned additions (no committed timeline):
- Kyverno policies specific to AI workloads (GPU resource validation)
- Grafana dashboards for GPU cluster utilization
- Air-gap packaging for GPU operator and driver images
Composition
Further Reading
- Platform Foundation — services inherited by this blueprint
- Blueprint Catalog — all blueprints with status