Troubleshooting Guide
Diagnose and resolve common issues with IBM GCM 2.0.1
System Health Check
Quick commands to verify overall system health.
Check All Pods
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
kubectl get pods -A
Check GCM Pods
kubectl get pods -n gcm-system
Check Services
kubectl get svc -n gcm-system
Check Storage
kubectl get pvc -A
kubectl get storageclass
Check Nodes
kubectl get nodes -o wide
Check Events
kubectl get events -n gcm-system --sort-by='.lastTimestamp'
Expected Healthy State
Running stateBound stateReadyDiagnostic Commands
Detailed commands for investigating specific components.
Pod Diagnostics
# Get detailed pod information
kubectl describe pod -n gcm-system
# Check pod logs
kubectl logs -n gcm-system
# Follow logs in real-time
kubectl logs -f -n gcm-system
# Get logs from previous container (if crashed)
kubectl logs -n gcm-system --previous
# Check pod resource usage
kubectl top pod -n gcm-system
Storage Diagnostics
# Check PVC status
kubectl get pvc -n gcm-system
# Describe PVC for details
kubectl describe pvc -n gcm-system
# Check storage classes
kubectl get storageclass
# Check Ceph cluster health (if using Rook Ceph)
kubectl get cephcluster -n rook-ceph
kubectl get pods -n rook-ceph
# Check Ceph status
kubectl exec -n rook-ceph deploy/rook-ceph-tools -- ceph status
Network Diagnostics
# Check service endpoints
kubectl get endpoints -n gcm-system
# Test service connectivity
kubectl run test-pod --rm -it --image=busybox -- sh
# Inside pod: wget -O- http://service-name.gcm-system.svc.cluster.local
# Check DNS resolution
kubectl run test-dns --rm -it --image=busybox -- nslookup gcm-service.gcm-system.svc.cluster.local
# Check firewall rules
firewall-cmd --list-all
# Test external connectivity
curl -k https://localhost:31443
curl -k https://localhost:30443
System Resource Diagnostics
# Check node resources
kubectl top nodes
# Check pod resources
kubectl top pods -n gcm-system
# Check disk usage
df -h
# Check memory usage
free -h
# Check CPU usage
top -bn1 | head -20
# Check system logs
journalctl -u k3s -n 100 --no-pager
Log Analysis
How to find and analyze logs for troubleshooting.
GCM Application Logs
# Get all GCM pod logs
kubectl logs -n gcm-system -l app=gcm --tail=100
# Search for errors
kubectl logs -n gcm-system -l app=gcm --tail=500 | grep -i error
# Search for warnings
kubectl logs -n gcm-system -l app=gcm --tail=500 | grep -i warning
# Get logs from specific time
kubectl logs -n gcm-system --since=1h
# Save logs to file
kubectl logs -n gcm-system > gcm-pod.log
System Logs
# K3s service logs
journalctl -u k3s -n 200 --no-pager
# Follow K3s logs
journalctl -u k3s -f
# System messages
tail -f /var/log/messages
# Check for OOM kills
dmesg | grep -i 'out of memory'
# Check for disk errors
dmesg | grep -i 'error'
Installation Logs
# View installation logs
ls -lh /root/*install*.log
# Check K3s installation log
cat /root/k3s_install.log
# Check GCM installation log
cat /root/gcm_install.log
# Search for errors in installation
grep -i error /root/*install*.log
Issue: Network Configuration Errors
Troubleshoot errors when running the VM configuration wizard (01-configure_vm.sh).
Error: "method 'manual' requires at least an address or a route"
Root Cause
This error occurs when:
- Subnet mask is not properly converted to CIDR notation
- NetworkManager connection name doesn't match the interface
- IP address format is incorrect
Solution
The latest version of 01-configure_vm.sh (v1.0.0+) includes fixes for this issue:
- ✅ Automatic subnet mask to CIDR conversion
- ✅ Improved connection detection with multiple fallback methods
- ✅ Better error messages showing available connections
# Download latest version
curl -O http://acefs01.ace.ibm.aessatl.arrow.com/downloads/gcm_2_0_1/01-configure_vm.sh
chmod +x 01-configure_vm.sh
# Re-run configuration
./01-configure_vm.sh
Manual Configuration (if script fails)
If the script continues to fail, configure manually:
# List available connections
nmcli connection show
# Identify your connection (usually matches interface name like 'ens33')
CONNECTION_NAME='ens33' # or 'System ens33', etc.
# Configure network (replace with your values)
nmcli connection modify "$CONNECTION_NAME" \
ipv4.method manual \
ipv4.addresses "172.20.28.202/24" \
ipv4.gateway "172.20.28.1" \
ipv4.dns "172.20.12.100"
# Bring up connection
nmcli connection up "$CONNECTION_NAME"
Subnet Mask to CIDR Conversion
Common subnet mask conversions:
| Subnet Mask | CIDR | Usable IPs |
|---|---|---|
| 255.255.255.0 | /24 | 254 |
| 255.255.254.0 | /23 | 510 |
| 255.255.252.0 | /22 | 1,022 |
| 255.255.0.0 | /16 | 65,534 |
Connection Detection Issues
Cannot find NetworkManager connection
Diagnosis:
# List all connections
nmcli connection show
# Check active connections
nmcli connection show --active
# Check network interfaces
ip link show
Solution: Use the exact connection name from nmcli connection show output.
Issue: Pods Not Starting
Troubleshoot pods stuck in Pending, CrashLoopBackOff, or Error states.
Symptoms
- Pods stuck in
Pendingstate - Pods in
CrashLoopBackOff - Pods showing
ErrororImagePullBackOff - High restart count on pods
Diagnosis
# Check pod status
kubectl get pods -n gcm-system
# Get detailed pod information
kubectl describe pod -n gcm-system
# Check pod events
kubectl get events -n gcm-system --field-selector involvedObject.name=
# Check pod logs
kubectl logs -n gcm-system --previous
Common Causes & Solutions
Pod cannot be scheduled due to insufficient CPU or memory.
# Check node resources
kubectl top nodes
kubectl describe node
# Solution: Increase VM resources or reduce pod requests
PVC not bound or storage class unavailable.
# Check PVC status
kubectl get pvc -n gcm-system
# Check storage classes
kubectl get storageclass
# Solution: Ensure storage classes exist and are available
Cannot pull container image from registry.
# Check image pull secrets
kubectl get secrets -n gcm-system
# Test internet connectivity
curl -I https://registry.k8s.io
# Solution: Check network connectivity and registry credentials
Issue: Storage Problems
Troubleshoot PVC binding issues and storage class problems.
Symptoms
- PVCs stuck in
Pendingstate - Storage class not found errors
- Ceph cluster not healthy
- Disk space issues
Check Storage Classes
# List storage classes
kubectl get storageclass
# Expected output should show:
# - rook-cephfs (for filesystem storage)
# - rook-ceph-block (for block storage)
# OR
# - local-path (for local storage)
rook-cephfs and rook-ceph-block storage classes when using Ceph, or local-path for simple deployments.
Check Ceph Health (if using Rook Ceph)
# Check Ceph cluster
kubectl get cephcluster -n rook-ceph
# Check Ceph pods
kubectl get pods -n rook-ceph
# Check Ceph status
kubectl exec -n rook-ceph deploy/rook-ceph-tools -- ceph status
# Check Ceph OSD status
kubectl exec -n rook-ceph deploy/rook-ceph-tools -- ceph osd status
Check Disk Space
# Check disk usage
df -h
# Check /var partition (K3s uses this)
df -h /var
# Check for large files
du -sh /var/* | sort -h
# Clean up if needed
kubectl delete pods --field-selector status.phase=Succeeded -A
kubectl delete pods --field-selector status.phase=Failed -A
Solution: Recreate Storage Classes
If storage classes are missing, you may need to reinstall K3s with proper storage configuration.
# For local-path storage
./install_singlenode_k3s_localpath.sh
# For Ceph storage (requires second disk)
./install_singlenode_k3s_ceph.sh
Issue: Network Connectivity Problems
Troubleshoot network connectivity and DNS resolution issues.
Symptoms
- Cannot access GCM web interface
- Pods cannot communicate with each other
- DNS resolution failures
- Timeout errors
Check Firewall
# Check firewall status
firewall-cmd --state
# List open ports
firewall-cmd --list-all
# Open required ports if needed
firewall-cmd --permanent --add-port=30443/tcp
firewall-cmd --permanent --add-port=31443/tcp
firewall-cmd --reload
Test Connectivity
# Test local access
curl -k https://localhost:31443
curl -k https://localhost:30443
# Test from another machine
curl -k https://:31443
# Check if ports are listening
ss -tlnp | grep -E '30443|31443'
# Test DNS resolution
nslookup
ping
Check Service Endpoints
# Check services
kubectl get svc -n gcm-system
# Check endpoints
kubectl get endpoints -n gcm-system
# Describe service for details
kubectl describe svc -n gcm-system
DNS Issues
# Check DNS configuration
cat /etc/resolv.conf
# Test DNS resolution
nslookup google.com
nslookup
# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns
# Restart CoreDNS if needed
kubectl rollout restart deployment/coredns -n kube-system
Issue: OIDC Authentication Errors
Troubleshoot OIDC provider and authentication issues.
Symptoms
- Cannot login to GCM
- OIDC redirect errors
- "Invalid redirect URI" errors
- Certificate errors
Check OIDC Service
# Check OIDC pods
kubectl get pods -n gcm-system | grep oidc
# Check OIDC service
kubectl get svc -n gcm-system | grep oidc
# Test OIDC endpoint
curl -k https://localhost:30443/.well-known/openid-configuration
Fix OIDC FQDN Issues
If you're getting redirect URI errors, the OIDC provider may be using the wrong hostname.
# Use the fix script
./fix_oidc_fqdn.sh
# Or manually update OIDC configuration
kubectl edit configmap oidc-config -n gcm-system
Check Certificates
# Check certificates
kubectl get certificates -n gcm-system
# Check certificate details
kubectl describe certificate -n gcm-system
# Check cert-manager logs
kubectl logs -n cert-manager -l app=cert-manager
Issue: Performance Problems
Troubleshoot slow response times and resource constraints.
Symptoms
- Slow web interface
- High CPU or memory usage
- Pods being OOM killed
- Disk I/O bottlenecks
Check Resource Usage
# Check node resources
kubectl top nodes
# Check pod resources
kubectl top pods -n gcm-system
# Check for resource limits
kubectl describe pod -n gcm-system | grep -A 5 Limits
# Check for OOM kills
dmesg | grep -i 'out of memory'
Check Disk I/O
# Check disk usage
df -h
# Check I/O statistics
iostat -x 1 5
# Check for slow disks
dmesg | grep -i 'slow'
# If using Ceph, check OSD performance
kubectl exec -n rook-ceph deploy/rook-ceph-tools -- ceph osd perf
Solutions
- Add more vCPUs (minimum 24, recommended 32+)
- Increase RAM (minimum 48GB, recommended 64GB+)
- Use faster storage (SSD recommended)
- Use appropriate sizing model (xsmall, small, medium, large)
- Adjust pod resource requests and limits
- Enable resource quotas
Restart Services
How to restart GCM services without losing data.
Restart Individual Pods
# Delete pod (will be recreated automatically)
kubectl delete pod -n gcm-system
# Restart deployment
kubectl rollout restart deployment/ -n gcm-system
# Restart all GCM deployments
kubectl rollout restart deployment -n gcm-system
Restart K3s Service
# Restart K3s
systemctl restart k3s
# Check K3s status
systemctl status k3s
# Wait for pods to come back
watch kubectl get pods -A
Graceful Shutdown
Properly shut down GCM to prevent data corruption.
Using the Shutdown Script
# Stop GCM (leave system running)
./shutdown_gcm_gracefully.sh
# Stop GCM and shutdown system
./shutdown_gcm_gracefully.sh --shutdown-system
Shutdown Order
The script follows this order to ensure clean shutdown:
- GCM Application Pods (2-3 minutes)
- Middleware Services - Redis, Kafka, MongoDB, PostgreSQL (3-5 minutes)
- Operators (1-2 minutes)
- Remaining Pods (1 minute)
- Cert-Manager (1 minute)
- Rook Ceph (optional, 2-3 minutes)
- K3s (30 seconds)
- System Shutdown (optional)
Manual Shutdown
If the script is not available:
# Scale down GCM deployments
kubectl scale deployment --all --replicas=0 -n gcm-system
# Stop K3s
systemctl stop k3s
# Shutdown system
shutdown -h now
Reset Installation
Start over with a clean installation.
Uninstall K3s
# Uninstall K3s (removes everything)
/usr/local/bin/k3s-uninstall.sh
# Verify removal
systemctl status k3s
# Should show: Unit k3s.service could not be found
Clean Up Remaining Files
# Remove K3s data
rm -rf /var/lib/rancher/k3s
rm -rf /etc/rancher/k3s
# Remove Rook Ceph data (if used)
rm -rf /var/lib/rook
# Clean up configuration
rm -rf /etc/gcm
Reinstall
After cleanup, follow the Quick Start guide to reinstall: