System Overview
Comprehensive overview of KFP Operator architecture and system design for platform engineers
This section provides a comprehensive deep dive into the KFP Operator architecture, designed for platform engineers who need to understand the system internals for installation, configuration, maintenance, and troubleshooting.
The KFP Operator follows the standard Kubernetes operator pattern, extending the Kubernetes API with domain-specific knowledge for ML pipeline management. The architecture is designed for scalability, reliability, and extensibility.
The heart of the KFP Operator, responsible for:
The provider system abstracts different ML orchestration platforms:
graph TB
subgraph "Provider Interface"
Interface[Provider Interface]
Registry[Provider Registry]
Factory[Provider Factory]
end
subgraph "Provider Implementations"
KFPProvider[KFP Provider]
VertexProvider[Vertex AI Provider]
CustomProvider[Custom Provider]
end
subgraph "External Platforms"
KFP[Kubeflow Pipelines]
Vertex[Vertex AI]
Custom[Custom Platform]
end
Interface --> KFPProvider
Interface --> VertexProvider
Interface --> CustomProvider
KFPProvider --> KFP
VertexProvider --> Vertex
CustomProvider --> Custom
Registry --> Factory
Factory --> InterfaceThe operator uses Argo Workflows for reliable execution:
graph TB
subgraph "KFP Operator"
RunController[Run Controller]
WorkflowTemplate[Workflow Template]
end
subgraph "Argo Workflows"
ArgoController[Argo Controller]
Workflow[Workflow Instance]
WorkflowPods[Workflow Pods]
end
subgraph "Provider Service"
ProviderPod[Provider Pod]
ProviderAPI[Provider API]
end
RunController --> WorkflowTemplate
WorkflowTemplate --> ArgoController
ArgoController --> Workflow
Workflow --> WorkflowPods
WorkflowPods --> ProviderPod
ProviderPod --> ProviderAPIsequenceDiagram
participant User
participant K8sAPI as Kubernetes API
participant Controller as KFP Controller
participant Provider as Provider Service
participant Platform as ML Platform
User->>K8sAPI: Create Pipeline Resource
K8sAPI->>Controller: Pipeline Created Event
Controller->>Controller: Validate Pipeline Spec
Controller->>Provider: Compile Pipeline
Provider->>Platform: Register Pipeline
Platform-->>Provider: Pipeline ID
Provider-->>Controller: Compilation Result
Controller->>K8sAPI: Update Pipeline StatussequenceDiagram
participant User
participant K8sAPI as Kubernetes API
participant Controller as KFP Controller
participant Argo as Argo Workflows
participant Provider as Provider Service
participant Platform as ML Platform
User->>K8sAPI: Create Run Resource
K8sAPI->>Controller: Run Created Event
Controller->>Argo: Create Workflow
Argo->>Provider: Execute Provider Logic
Provider->>Platform: Start Pipeline Run
Platform-->>Provider: Run Status Updates
Provider-->>Argo: Report Progress
Argo-->>Controller: Workflow Status
Controller->>K8sAPI: Update Run Statusgraph TB
subgraph "KFP Operator Namespace"
ControllerSA[Controller Service Account]
ArgoSA[Argo Service Account]
ProviderSA[Provider Service Account]
end
subgraph "RBAC"
ControllerRole[Controller ClusterRole]
ArgoRole[Argo Role]
ProviderRole[Provider Role]
end
subgraph "External Systems"
CloudProvider[Cloud Provider IAM]
Registry[Container Registry]
Storage[Artifact Storage]
end
ControllerSA --> ControllerRole
ArgoSA --> ArgoRole
ProviderSA --> ProviderRole
ProviderSA --> CloudProvider
ProviderSA --> Registry
ProviderSA --> Storage# Restrict operator network access
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: kfp-operator-network-policy
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: kfp-operator
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: kfp-operator-system
egress:
- to:
- namespaceSelector:
matchLabels:
name: kubeflow
ports:
- protocol: TCP
port: 8888
graph TB
subgraph "KFP Operator"
Controller[Controller Manager]
Metrics[Metrics Endpoint]
end
subgraph "Monitoring Stack"
Prometheus[Prometheus Server]
ServiceMonitor[ServiceMonitor]
Grafana[Grafana Dashboard]
end
subgraph "Alerting"
AlertManager[Alert Manager]
PagerDuty[PagerDuty]
Slack[Slack]
end
Controller --> Metrics
ServiceMonitor --> Metrics
Prometheus --> ServiceMonitor
Prometheus --> AlertManager
Grafana --> Prometheus
AlertManager --> PagerDuty
AlertManager --> Slack{
"timestamp": "2024-01-15T10:30:00Z",
"level": "INFO",
"logger": "pipeline-controller",
"message": "Pipeline reconciliation completed",
"pipeline": "training-pipeline",
"namespace": "ml-team",
"reconciliation_id": "abc-123-def",
"duration_ms": 150
}
# Check controller status
kubectl get pods -n kfp-operator-system
kubectl logs -n kfp-operator-system deployment/kfp-operator-controller-manager
# Check provider connectivity
kubectl get providers
kubectl describe provider <provider-name>
# Test provider endpoints (requires port-forwarding)
kubectl exec -it deployment/kfp-operator-controller-manager -- \
curl -f http://provider-endpoint/health
# Check resource consumption
kubectl top pods -n kfp-operator-system
# Monitor controller metrics
kubectl port-forward -n kfp-operator-system svc/kfp-operator-metrics 8080:8080
curl http://localhost:8080/metrics | grep controller
Need more details? Explore the System Overview for detailed component documentation, or check the Configuration Guide for customization options.
Comprehensive overview of KFP Operator architecture and system design for platform engineers