AI Blueprint

Purpose: For platform engineers, explains how the AI blueprint extends the platform foundation with GPU infrastructure patterns, policy controls, and consistent operations for AI/ML workloads.

Overview

AI workloads do not need a separate platform. They need GPUs, policy controls, and an ops model that does not fall apart when you scale past the demo. This blueprint provides infrastructure patterns designed for AI workloads — not generic compute with a GPU driver bolted on.

Status: Preview

What You Get

GPU-Ready From the Start — NVIDIA GPU Operator, node labeling, taints, and scheduling constraints configured for GPU workloads.
Policy Before Production — Security and compliance controls ship with the cluster, not added after the first audit finding.
Same Ops, Every Environment — Dev, staging, production use the same workflows, same tooling, same confidence level.

Capabilities

GPU Infrastructure

Component	Purpose
NVIDIA GPU Operator	Automated GPU driver and runtime management
Node labeling	GPU nodes labeled for targeted scheduling
Taints/tolerations	GPU nodes reserved for GPU workloads
Resource quotas	Prevent GPU hoarding across teams/namespaces
Scheduling constraints	Topology-aware placement for multi-GPU workloads

Security and Compliance

All platform foundation security applies unchanged:

Kyverno policies enforce image sources, resource limits, security contexts
Pod Security Admission prevents privilege escalation
RBAC Manager controls namespace access per team
SOPS encryption for model credentials and API keys
Harbor scans GPU workload images for vulnerabilities

Operational Consistency

Same GitOps workflow (FluxCD) for GPU workloads as any other service
Prometheus metrics for GPU utilization, memory, temperature
Loki for training job logs
Alertmanager for GPU health alerts
Velero backup includes GPU workload state

Current Status

The AI blueprint is in Preview. Available today:

GPU Operator deployment via gitops-base
Node labeling and taint configuration via Kubespray inventory
Scheduling constraints and resource quotas
Full observability and security from platform foundation

Planned additions (no committed timeline):

Kyverno policies specific to AI workloads (GPU resource validation)
Grafana dashboards for GPU cluster utilization
Air-gap packaging for GPU operator and driver images

AI Blueprint

Overview

What You Get

Capabilities

GPU Infrastructure

Security and Compliance

Operational Consistency

Current Status

Composition

Further Reading

Overview​

What You Get​

Capabilities​

GPU Infrastructure​

Security and Compliance​

Operational Consistency​

Current Status​

Composition​

Further Reading​

Overview

What You Get

Capabilities

GPU Infrastructure

Security and Compliance

Operational Consistency

Current Status

Composition

Further Reading