Troubleshooting Guide

Diagnose and resolve common issues with IBM GCM 2.0.1

System Health Check

Quick commands to verify overall system health.

Check All Pods

bash
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
kubectl get pods -A

Check GCM Pods

bash
kubectl get pods -n gcm-system

Check Services

bash
kubectl get svc -n gcm-system

Check Storage

bash
kubectl get pvc -A
kubectl get storageclass

Check Nodes

bash
kubectl get nodes -o wide

Check Events

bash
kubectl get events -n gcm-system --sort-by='.lastTimestamp'

Expected Healthy State

  • All GCM pods in Running state
  • All PVCs in Bound state
  • Node status is Ready
  • Services have endpoints assigned
  • No recent error events
  • Pod restart count is 0 or low
  • Diagnostic Commands

    Detailed commands for investigating specific components.

    Pod Diagnostics

    bash
    # Get detailed pod information
    kubectl describe pod  -n gcm-system
    
    # Check pod logs
    kubectl logs  -n gcm-system
    
    # Follow logs in real-time
    kubectl logs -f  -n gcm-system
    
    # Get logs from previous container (if crashed)
    kubectl logs  -n gcm-system --previous
    
    # Check pod resource usage
    kubectl top pod  -n gcm-system

    Storage Diagnostics

    bash
    # Check PVC status
    kubectl get pvc -n gcm-system
    
    # Describe PVC for details
    kubectl describe pvc  -n gcm-system
    
    # Check storage classes
    kubectl get storageclass
    
    # Check Ceph cluster health (if using Rook Ceph)
    kubectl get cephcluster -n rook-ceph
    kubectl get pods -n rook-ceph
    
    # Check Ceph status
    kubectl exec -n rook-ceph deploy/rook-ceph-tools -- ceph status

    Network Diagnostics

    bash
    # Check service endpoints
    kubectl get endpoints -n gcm-system
    
    # Test service connectivity
    kubectl run test-pod --rm -it --image=busybox -- sh
    # Inside pod: wget -O- http://service-name.gcm-system.svc.cluster.local
    
    # Check DNS resolution
    kubectl run test-dns --rm -it --image=busybox -- nslookup gcm-service.gcm-system.svc.cluster.local
    
    # Check firewall rules
    firewall-cmd --list-all
    
    # Test external connectivity
    curl -k https://localhost:31443
    curl -k https://localhost:30443

    System Resource Diagnostics

    bash
    # Check node resources
    kubectl top nodes
    
    # Check pod resources
    kubectl top pods -n gcm-system
    
    # Check disk usage
    df -h
    
    # Check memory usage
    free -h
    
    # Check CPU usage
    top -bn1 | head -20
    
    # Check system logs
    journalctl -u k3s -n 100 --no-pager

    Log Analysis

    How to find and analyze logs for troubleshooting.

    GCM Application Logs

    bash
    # Get all GCM pod logs
    kubectl logs -n gcm-system -l app=gcm --tail=100
    
    # Search for errors
    kubectl logs -n gcm-system -l app=gcm --tail=500 | grep -i error
    
    # Search for warnings
    kubectl logs -n gcm-system -l app=gcm --tail=500 | grep -i warning
    
    # Get logs from specific time
    kubectl logs -n gcm-system  --since=1h
    
    # Save logs to file
    kubectl logs -n gcm-system  > gcm-pod.log

    System Logs

    bash
    # K3s service logs
    journalctl -u k3s -n 200 --no-pager
    
    # Follow K3s logs
    journalctl -u k3s -f
    
    # System messages
    tail -f /var/log/messages
    
    # Check for OOM kills
    dmesg | grep -i 'out of memory'
    
    # Check for disk errors
    dmesg | grep -i 'error'

    Installation Logs

    bash
    # View installation logs
    ls -lh /root/*install*.log
    
    # Check K3s installation log
    cat /root/k3s_install.log
    
    # Check GCM installation log
    cat /root/gcm_install.log
    
    # Search for errors in installation
    grep -i error /root/*install*.log

    Issue: Network Configuration Errors

    Troubleshoot errors when running the VM configuration wizard (01-configure_vm.sh).

    Error: "method 'manual' requires at least an address or a route"

    Symptom: NetworkManager fails to apply static IP configuration with error about missing address or route.

    Root Cause

    This error occurs when:

    • Subnet mask is not properly converted to CIDR notation
    • NetworkManager connection name doesn't match the interface
    • IP address format is incorrect

    Solution

    The latest version of 01-configure_vm.sh (v1.0.0+) includes fixes for this issue:

    • ✅ Automatic subnet mask to CIDR conversion
    • ✅ Improved connection detection with multiple fallback methods
    • ✅ Better error messages showing available connections
    bash
    # Download latest version
    curl -O http://acefs01.ace.ibm.aessatl.arrow.com/downloads/gcm_2_0_1/01-configure_vm.sh
    chmod +x 01-configure_vm.sh
    
    # Re-run configuration
    ./01-configure_vm.sh

    Manual Configuration (if script fails)

    If the script continues to fail, configure manually:

    bash
    # List available connections
    nmcli connection show
    
    # Identify your connection (usually matches interface name like 'ens33')
    CONNECTION_NAME='ens33'  # or 'System ens33', etc.
    
    # Configure network (replace with your values)
    nmcli connection modify "$CONNECTION_NAME" \
        ipv4.method manual \
        ipv4.addresses "172.20.28.202/24" \
        ipv4.gateway "172.20.28.1" \
        ipv4.dns "172.20.12.100"
    
    # Bring up connection
    nmcli connection up "$CONNECTION_NAME"

    Subnet Mask to CIDR Conversion

    Common subnet mask conversions:

    Subnet Mask CIDR Usable IPs
    255.255.255.0 /24 254
    255.255.254.0 /23 510
    255.255.252.0 /22 1,022
    255.255.0.0 /16 65,534

    Connection Detection Issues

    Cannot find NetworkManager connection

    Diagnosis:

    bash
    # List all connections
    nmcli connection show
    
    # Check active connections
    nmcli connection show --active
    
    # Check network interfaces
    ip link show

    Solution: Use the exact connection name from nmcli connection show output.

    Tip: The updated script automatically tries multiple detection methods and falls back to using the interface name directly if needed.

    Issue: Pods Not Starting

    Troubleshoot pods stuck in Pending, CrashLoopBackOff, or Error states.

    Symptoms

    • Pods stuck in Pending state
    • Pods in CrashLoopBackOff
    • Pods showing Error or ImagePullBackOff
    • High restart count on pods

    Diagnosis

    bash
    # Check pod status
    kubectl get pods -n gcm-system
    
    # Get detailed pod information
    kubectl describe pod  -n gcm-system
    
    # Check pod events
    kubectl get events -n gcm-system --field-selector involvedObject.name=
    
    # Check pod logs
    kubectl logs  -n gcm-system --previous

    Common Causes & Solutions

    Insufficient Resources

    Pod cannot be scheduled due to insufficient CPU or memory.

    bash
    # Check node resources
    kubectl top nodes
    kubectl describe node
    
    # Solution: Increase VM resources or reduce pod requests
    Storage Issues

    PVC not bound or storage class unavailable.

    bash
    # Check PVC status
    kubectl get pvc -n gcm-system
    
    # Check storage classes
    kubectl get storageclass
    
    # Solution: Ensure storage classes exist and are available
    Image Pull Errors

    Cannot pull container image from registry.

    bash
    # Check image pull secrets
    kubectl get secrets -n gcm-system
    
    # Test internet connectivity
    curl -I https://registry.k8s.io
    
    # Solution: Check network connectivity and registry credentials

    Issue: Storage Problems

    Troubleshoot PVC binding issues and storage class problems.

    Symptoms

    • PVCs stuck in Pending state
    • Storage class not found errors
    • Ceph cluster not healthy
    • Disk space issues

    Check Storage Classes

    bash
    # List storage classes
    kubectl get storageclass
    
    # Expected output should show:
    # - rook-cephfs (for filesystem storage)
    # - rook-ceph-block (for block storage)
    # OR
    # - local-path (for local storage)
    Critical: GCM requires BOTH rook-cephfs and rook-ceph-block storage classes when using Ceph, or local-path for simple deployments.

    Check Ceph Health (if using Rook Ceph)

    bash
    # Check Ceph cluster
    kubectl get cephcluster -n rook-ceph
    
    # Check Ceph pods
    kubectl get pods -n rook-ceph
    
    # Check Ceph status
    kubectl exec -n rook-ceph deploy/rook-ceph-tools -- ceph status
    
    # Check Ceph OSD status
    kubectl exec -n rook-ceph deploy/rook-ceph-tools -- ceph osd status

    Check Disk Space

    bash
    # Check disk usage
    df -h
    
    # Check /var partition (K3s uses this)
    df -h /var
    
    # Check for large files
    du -sh /var/* | sort -h
    
    # Clean up if needed
    kubectl delete pods --field-selector status.phase=Succeeded -A
    kubectl delete pods --field-selector status.phase=Failed -A

    Solution: Recreate Storage Classes

    If storage classes are missing, you may need to reinstall K3s with proper storage configuration.

    bash
    # For local-path storage
    ./install_singlenode_k3s_localpath.sh
    
    # For Ceph storage (requires second disk)
    ./install_singlenode_k3s_ceph.sh

    Issue: Network Connectivity Problems

    Troubleshoot network connectivity and DNS resolution issues.

    Symptoms

    • Cannot access GCM web interface
    • Pods cannot communicate with each other
    • DNS resolution failures
    • Timeout errors

    Check Firewall

    bash
    # Check firewall status
    firewall-cmd --state
    
    # List open ports
    firewall-cmd --list-all
    
    # Open required ports if needed
    firewall-cmd --permanent --add-port=30443/tcp
    firewall-cmd --permanent --add-port=31443/tcp
    firewall-cmd --reload

    Test Connectivity

    bash
    # Test local access
    curl -k https://localhost:31443
    curl -k https://localhost:30443
    
    # Test from another machine
    curl -k https://:31443
    
    # Check if ports are listening
    ss -tlnp | grep -E '30443|31443'
    
    # Test DNS resolution
    nslookup 
    ping 

    Check Service Endpoints

    bash
    # Check services
    kubectl get svc -n gcm-system
    
    # Check endpoints
    kubectl get endpoints -n gcm-system
    
    # Describe service for details
    kubectl describe svc  -n gcm-system

    DNS Issues

    bash
    # Check DNS configuration
    cat /etc/resolv.conf
    
    # Test DNS resolution
    nslookup google.com
    nslookup 
    
    # Check CoreDNS pods
    kubectl get pods -n kube-system -l k8s-app=kube-dns
    
    # Restart CoreDNS if needed
    kubectl rollout restart deployment/coredns -n kube-system

    Issue: OIDC Authentication Errors

    Troubleshoot OIDC provider and authentication issues.

    Symptoms

    • Cannot login to GCM
    • OIDC redirect errors
    • "Invalid redirect URI" errors
    • Certificate errors

    Check OIDC Service

    bash
    # Check OIDC pods
    kubectl get pods -n gcm-system | grep oidc
    
    # Check OIDC service
    kubectl get svc -n gcm-system | grep oidc
    
    # Test OIDC endpoint
    curl -k https://localhost:30443/.well-known/openid-configuration

    Fix OIDC FQDN Issues

    If you're getting redirect URI errors, the OIDC provider may be using the wrong hostname.

    bash
    # Use the fix script
    ./fix_oidc_fqdn.sh
    
    # Or manually update OIDC configuration
    kubectl edit configmap oidc-config -n gcm-system

    Check Certificates

    bash
    # Check certificates
    kubectl get certificates -n gcm-system
    
    # Check certificate details
    kubectl describe certificate  -n gcm-system
    
    # Check cert-manager logs
    kubectl logs -n cert-manager -l app=cert-manager

    Issue: Performance Problems

    Troubleshoot slow response times and resource constraints.

    Symptoms

    • Slow web interface
    • High CPU or memory usage
    • Pods being OOM killed
    • Disk I/O bottlenecks

    Check Resource Usage

    bash
    # Check node resources
    kubectl top nodes
    
    # Check pod resources
    kubectl top pods -n gcm-system
    
    # Check for resource limits
    kubectl describe pod  -n gcm-system | grep -A 5 Limits
    
    # Check for OOM kills
    dmesg | grep -i 'out of memory'

    Check Disk I/O

    bash
    # Check disk usage
    df -h
    
    # Check I/O statistics
    iostat -x 1 5
    
    # Check for slow disks
    dmesg | grep -i 'slow'
    
    # If using Ceph, check OSD performance
    kubectl exec -n rook-ceph deploy/rook-ceph-tools -- ceph osd perf

    Solutions

    Increase VM Resources:
    • Add more vCPUs (minimum 24, recommended 32+)
    • Increase RAM (minimum 48GB, recommended 64GB+)
    • Use faster storage (SSD recommended)
    Optimize Configuration:
    • Use appropriate sizing model (xsmall, small, medium, large)
    • Adjust pod resource requests and limits
    • Enable resource quotas

    Restart Services

    How to restart GCM services without losing data.

    Restart Individual Pods

    bash
    # Delete pod (will be recreated automatically)
    kubectl delete pod  -n gcm-system
    
    # Restart deployment
    kubectl rollout restart deployment/ -n gcm-system
    
    # Restart all GCM deployments
    kubectl rollout restart deployment -n gcm-system

    Restart K3s Service

    bash
    # Restart K3s
    systemctl restart k3s
    
    # Check K3s status
    systemctl status k3s
    
    # Wait for pods to come back
    watch kubectl get pods -A
    Warning: Restarting K3s will temporarily disrupt all services. Use this only when necessary.

    Graceful Shutdown

    Properly shut down GCM to prevent data corruption.

    Using the Shutdown Script

    bash
    # Stop GCM (leave system running)
    ./shutdown_gcm_gracefully.sh
    
    # Stop GCM and shutdown system
    ./shutdown_gcm_gracefully.sh --shutdown-system

    Shutdown Order

    The script follows this order to ensure clean shutdown:

    1. GCM Application Pods (2-3 minutes)
    2. Middleware Services - Redis, Kafka, MongoDB, PostgreSQL (3-5 minutes)
    3. Operators (1-2 minutes)
    4. Remaining Pods (1 minute)
    5. Cert-Manager (1 minute)
    6. Rook Ceph (optional, 2-3 minutes)
    7. K3s (30 seconds)
    8. System Shutdown (optional)
    Total Time: 8-15 minutes depending on options selected.

    Manual Shutdown

    If the script is not available:

    bash
    # Scale down GCM deployments
    kubectl scale deployment --all --replicas=0 -n gcm-system
    
    # Stop K3s
    systemctl stop k3s
    
    # Shutdown system
    shutdown -h now

    Reset Installation

    Start over with a clean installation.

    Warning: This will DELETE ALL DATA. Make backups before proceeding!

    Uninstall K3s

    bash
    # Uninstall K3s (removes everything)
    /usr/local/bin/k3s-uninstall.sh
    
    # Verify removal
    systemctl status k3s
    # Should show: Unit k3s.service could not be found

    Clean Up Remaining Files

    bash
    # Remove K3s data
    rm -rf /var/lib/rancher/k3s
    rm -rf /etc/rancher/k3s
    
    # Remove Rook Ceph data (if used)
    rm -rf /var/lib/rook
    
    # Clean up configuration
    rm -rf /etc/gcm

    Reinstall

    After cleanup, follow the Quick Start guide to reinstall: