Troubleshooting
Troubleshooting ML Pipelines
Diagnose and resolve common issues when developing and running ML pipelines with the KFP Operator.
Quick Problem Resolution
Most Common Issues
1. Pipeline Won’t Deploy
Symptoms: Pipeline resource shows error status
kubectl get pipeline my-pipeline
# NAME STATUS PROVIDER AGE
# my-pipeline Error - 2m
Quick Fixes:
- Check image exists and is accessible
- Verify provider is ready and configured
- Check resource quotas and limits
Detailed Guide: Pipeline Deployment Issues
2. Run Fails to Start
Symptoms: Run resource stuck in pending state
kubectl get run my-run
# NAME STATUS PIPELINE AGE
# my-run Pending my-pipeline 5m
Quick Fixes:
- Verify pipeline is ready
- Check runtime parameters are valid
- Ensure sufficient cluster resources
Detailed Guide: Run Execution Issues
3. Pipeline Components Fail
Symptoms: Individual pipeline steps fail during execution
kubectl logs -l workflows.argoproj.io/workflow=my-run
# Error: Component 'trainer' failed with exit code 1
Quick Fixes:
- Check component logs for specific errors
- Verify data inputs and outputs
- Check resource allocation
Detailed Guide: Component Debugging
Diagnostic Tools
Essential Commands
Check Pipeline Status
# Get pipeline overview
kubectl get pipelines
# Detailed pipeline information
kubectl describe pipeline <pipeline-name>
# Check pipeline events
kubectl get events --field-selector involvedObject.name=<pipeline-name>
View Kubernetes Events
The operator emits Kubernetes events for all resource transitions which can be viewed using kubectl describe.
# View events for a specific pipeline
kubectl describe pipeline pipeline-sample
# Example output:
# Events:
# Type Reason Age From Message
# ---- ------ ---- ---- -------
# Normal Syncing 5m54s kfp-operator Updating [version: "v5-841641"]
# Warning SyncFailed 101s kfp-operator Failed [version: "v5-841641"]: pipeline update failed
# Normal Syncing 9m47s kfp-operator Updating [version: "57be7f4-681dd8"]
# Normal Synced 78s kfp-operator Succeeded [version: "57be7f4-681dd8"]
Monitor Run Execution
# Watch run progress
kubectl get run <run-name> -w
# Get run details
kubectl describe run <run-name>
# Check associated workflow
kubectl get workflows -l pipelines.kubeflow.org/run=<run-name>
View Logs
# Get workflow logs
kubectl logs -l workflows.argoproj.io/workflow=<run-name>
# Get specific component logs
kubectl logs <pod-name> -c <container-name>
# Follow logs in real-time
kubectl logs -f -l workflows.argoproj.io/workflow=<run-name>
Debugging Workflow
Check Resource Status
kubectl get pipeline,run,providerExamine Events
kubectl get events --sort-by='.lastTimestamp'Review Logs
kubectl logs -l app.kubernetes.io/name=kfp-operatorValidate Configuration
kubectl get <resource> -o yaml
Common Problems and Solutions
Pipeline Deployment Issues
Problem: Image Pull Errors
Error Message: Failed to pull image: unauthorized
Causes:
- Container registry authentication issues
- Image doesn’t exist or wrong tag
- Network connectivity problems
Solutions:
# Check if image exists
docker pull <your-image>
# Verify registry credentials
kubectl get secrets -o yaml | grep dockerconfigjson
# Test network connectivity
kubectl run test-pod --image=busybox --rm -it -- wget <registry-url>
Problem: Resource Validation Errors
Error Message: Pipeline.spec.image is required
Causes:
- Missing required fields in Pipeline spec
- Invalid YAML syntax
- Incorrect API version
Solutions:
# Validate YAML syntax
kubectl apply --dry-run=client -f pipeline.yaml
# Check API version
kubectl api-resources | grep pipelines
# Use kubectl explain for field reference
kubectl explain pipeline.spec
Run Execution Issues
Problem: Parameter Validation Errors
Error Message: Invalid runtime parameter: learning_rate
Causes:
- Parameter not defined in pipeline
- Wrong parameter type or format
- Typo in parameter name
Solutions:
# Define parameters in pipeline code
learning_rate = RuntimeParameter(
name='learning_rate',
default=0.001,
ptype=float
)
# Use correct parameter in Run
spec:
parameters:
- name: learning_rate
value: "0.001" # String value, will be converted
Enable Debug Logging
Pipeline-Level Debugging
# Add debug logging to pipeline components
import logging
logging.basicConfig(level=logging.DEBUG)
def debug_component_execution():
logger = logging.getLogger(__name__)
logger.debug("Component starting execution")
logger.debug(f"Input parameters: {locals()}")
Operator-Level Debugging
# Check operator logs with increased verbosity
kubectl logs -n kfp-operator-system deployment/kfp-operator-controller-manager
# Enable debug mode (platform team may need to do this)
# Set logging.verbosity: 2 in operator configuration
Pipeline Compilation Errors
Compiling locally
The KFP-Operator’s compiler can be used locally for debugging purposes. This can be especially useful for troubleshooting environment variable and beam argument resolution.
Environment setup and compiler injection
The compiler is injected into a shared directory first before it can be called from within the pipeline image. Note that the setup is usually only needed once unless you want to use a different version of the compiler.
export KFP_COMPILER_IMAGE=ghcr.io/kfp-operator/kfp-operator-tfx-compiler:<KFP-Operator version>
docker pull $KFP_COMPILER_IMAGE
# Create a temporary directory for the following steps, alternatively choose a different location
SHARED_DIR=$(mktemp -d)
# Inject the compiler into the shared temporary directory
docker run -v $SHARED_DIR:/shared $KFP_COMPILER_IMAGE /shared
Compiler configuration
The compilation process relies on the pipeline resource and the provider configuration being passed:
export PIPELINE_IMAGE=<your pipeline image>
# create the pipeline resource
cat > $SHARED_DIR/pipeline.yaml << EOF
name: <Your pipeline name>
image: $PIPELINE_IMAGE
framework:
name: tfx
parameters:
components: <component function>
beamArgs:
- [] # List of NamedValues for beam arguments
env:
<Dict[str, str] of environment variables to be passed to the compilation step>
EOF
Running the compiler
You can then run the compiler from inside your pipeline container to produce $SHARED_DIR/pipeline_out.yaml.
# Run the compiler in your pipeline image
docker run -v $SHARED_DIR:/shared --entrypoint /shared/compile.sh $PIPELINE_IMAGE --pipeline_config /shared/pipeline.yaml --output_file /shared/pipeline_out.yaml
Getting Additional Help
When to Escalate
Contact your platform team if you encounter:
- Operator installation issues
- Provider connectivity problems
- Cluster resource constraints
- RBAC or permission errors
Self-Service Debugging
You can resolve these issues yourself:
- Pipeline code errors
- Runtime parameter issues
- Data processing problems
- Component configuration errors
Community Resources
- GitHub Issues: Search for known issues
- GitHub Discussions: Ask questions
- Platform Engineer Docs: For platform-level issues
Related Documentation
- Best Practices: Prevent common issues
- API Reference: Understand resource specifications
- Tutorials: Learn through examples
Still having issues? Check the detailed debugging guide for comprehensive troubleshooting procedures and advanced diagnostic techniques.