Troubleshooting Guide - IBM GCM 2.0.1

System Health Check

Quick commands to verify overall system health.

Check All Pods

bash

export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
kubectl get pods -A

Check GCM Pods

bash

kubectl get pods -n gcm-system

Check Services

bash

kubectl get svc -n gcm-system

Check Storage

bash

kubectl get pvc -A
kubectl get storageclass

Check Nodes

bash

kubectl get nodes -o wide

Check Events

bash

kubectl get events -n gcm-system --sort-by='.lastTimestamp'

Expected Healthy State

All GCM pods in Running state

All PVCs in Bound state

Node status is Ready

Services have endpoints assigned

No recent error events

Pod restart count is 0 or low

Diagnostic Commands

Detailed commands for investigating specific components.

Pod Diagnostics

bash

# Get detailed pod information
kubectl describe pod  -n gcm-system

# Check pod logs
kubectl logs  -n gcm-system

# Follow logs in real-time
kubectl logs -f  -n gcm-system

# Get logs from previous container (if crashed)
kubectl logs  -n gcm-system --previous

# Check pod resource usage
kubectl top pod  -n gcm-system

Storage Diagnostics

bash

# Check PVC status
kubectl get pvc -n gcm-system

# Describe PVC for details
kubectl describe pvc  -n gcm-system

# Check storage classes
kubectl get storageclass

# Check Ceph cluster health (if using Rook Ceph)
kubectl get cephcluster -n rook-ceph
kubectl get pods -n rook-ceph

# Check Ceph status
kubectl exec -n rook-ceph deploy/rook-ceph-tools -- ceph status

Network Diagnostics

bash

# Check service endpoints
kubectl get endpoints -n gcm-system

# Test service connectivity
kubectl run test-pod --rm -it --image=busybox -- sh
# Inside pod: wget -O- http://service-name.gcm-system.svc.cluster.local

# Check DNS resolution
kubectl run test-dns --rm -it --image=busybox -- nslookup gcm-service.gcm-system.svc.cluster.local

# Check firewall rules
firewall-cmd --list-all

# Test external connectivity
curl -k https://localhost:31443
curl -k https://localhost:30443

System Resource Diagnostics

bash

# Check node resources
kubectl top nodes

# Check pod resources
kubectl top pods -n gcm-system

# Check disk usage
df -h

# Check memory usage
free -h

# Check CPU usage
top -bn1 | head -20

# Check system logs
journalctl -u k3s -n 100 --no-pager

Log Analysis

How to find and analyze logs for troubleshooting.

GCM Application Logs

bash

# Get all GCM pod logs
kubectl logs -n gcm-system -l app=gcm --tail=100

# Search for errors
kubectl logs -n gcm-system -l app=gcm --tail=500 | grep -i error

# Search for warnings
kubectl logs -n gcm-system -l app=gcm --tail=500 | grep -i warning

# Get logs from specific time
kubectl logs -n gcm-system  --since=1h

# Save logs to file
kubectl logs -n gcm-system  > gcm-pod.log

System Logs

bash

# K3s service logs
journalctl -u k3s -n 200 --no-pager

# Follow K3s logs
journalctl -u k3s -f

# System messages
tail -f /var/log/messages

# Check for OOM kills
dmesg | grep -i 'out of memory'

# Check for disk errors
dmesg | grep -i 'error'

Installation Logs

bash

# View installation logs
ls -lh /root/*install*.log

# Check K3s installation log
cat /root/k3s_install.log

# Check GCM installation log
cat /root/gcm_install.log

# Search for errors in installation
grep -i error /root/*install*.log

Issue: Network Configuration Errors

Troubleshoot errors when running the VM configuration wizard (01-configure_vm.sh).

Error: "method 'manual' requires at least an address or a route"

Symptom: NetworkManager fails to apply static IP configuration with error about missing address or route.

Root Cause

This error occurs when:

Subnet mask is not properly converted to CIDR notation
NetworkManager connection name doesn't match the interface
IP address format is incorrect

Solution

The latest version of 01-configure_vm.sh (v1.0.0+) includes fixes for this issue:

✅ Automatic subnet mask to CIDR conversion
✅ Improved connection detection with multiple fallback methods
✅ Better error messages showing available connections

bash

# Download latest version
curl -O http://acefs01.ace.ibm.aessatl.arrow.com/downloads/gcm_2_0_1/01-configure_vm.sh
chmod +x 01-configure_vm.sh

# Re-run configuration
./01-configure_vm.sh

Manual Configuration (if script fails)

If the script continues to fail, configure manually:

bash

# List available connections
nmcli connection show

# Identify your connection (usually matches interface name like 'ens33')
CONNECTION_NAME='ens33'  # or 'System ens33', etc.

# Configure network (replace with your values)
nmcli connection modify "$CONNECTION_NAME" \
    ipv4.method manual \
    ipv4.addresses "172.20.28.202/24" \
    ipv4.gateway "172.20.28.1" \
    ipv4.dns "172.20.12.100"

# Bring up connection
nmcli connection up "$CONNECTION_NAME"

Subnet Mask to CIDR Conversion

Common subnet mask conversions:

Subnet Mask	CIDR	Usable IPs
255.255.255.0	/24	254
255.255.254.0	/23	510
255.255.252.0	/22	1,022
255.255.0.0	/16	65,534

Connection Detection Issues

Cannot find NetworkManager connection

Diagnosis:

bash

# List all connections
nmcli connection show

# Check active connections
nmcli connection show --active

# Check network interfaces
ip link show

Solution: Use the exact connection name from nmcli connection show output.

Tip: The updated script automatically tries multiple detection methods and falls back to using the interface name directly if needed.

Issue: Pods Not Starting

Troubleshoot pods stuck in Pending, CrashLoopBackOff, or Error states.

Symptoms

Pods stuck in Pending state
Pods in CrashLoopBackOff
Pods showing Error or ImagePullBackOff
High restart count on pods

Diagnosis

bash

# Check pod status
kubectl get pods -n gcm-system

# Get detailed pod information
kubectl describe pod  -n gcm-system

# Check pod events
kubectl get events -n gcm-system --field-selector involvedObject.name=

# Check pod logs
kubectl logs  -n gcm-system --previous

Common Causes & Solutions

Insufficient Resources

Pod cannot be scheduled due to insufficient CPU or memory.

bash

# Check node resources
kubectl top nodes
kubectl describe node

# Solution: Increase VM resources or reduce pod requests

Storage Issues

PVC not bound or storage class unavailable.

bash

# Check PVC status
kubectl get pvc -n gcm-system

# Check storage classes
kubectl get storageclass

# Solution: Ensure storage classes exist and are available

Image Pull Errors

Cannot pull container image from registry.

bash

# Check image pull secrets
kubectl get secrets -n gcm-system

# Test internet connectivity
curl -I https://registry.k8s.io

# Solution: Check network connectivity and registry credentials

Issue: Storage Problems

Troubleshoot PVC binding issues and storage class problems.

Symptoms

PVCs stuck in Pending state
Storage class not found errors
Ceph cluster not healthy
Disk space issues

Check Storage Classes

bash

# List storage classes
kubectl get storageclass

# Expected output should show:
# - rook-cephfs (for filesystem storage)
# - rook-ceph-block (for block storage)
# OR
# - local-path (for local storage)

Critical: GCM requires BOTH rook-cephfs and rook-ceph-block storage classes when using Ceph, or local-path for simple deployments.

Check Ceph Health (if using Rook Ceph)

bash

# Check Ceph cluster
kubectl get cephcluster -n rook-ceph

# Check Ceph pods
kubectl get pods -n rook-ceph

# Check Ceph status
kubectl exec -n rook-ceph deploy/rook-ceph-tools -- ceph status

# Check Ceph OSD status
kubectl exec -n rook-ceph deploy/rook-ceph-tools -- ceph osd status

Check Disk Space

bash

# Check disk usage
df -h

# Check /var partition (K3s uses this)
df -h /var

# Check for large files
du -sh /var/* | sort -h

# Clean up if needed
kubectl delete pods --field-selector status.phase=Succeeded -A
kubectl delete pods --field-selector status.phase=Failed -A

Solution: Recreate Storage Classes

If storage classes are missing, you may need to reinstall K3s with proper storage configuration.

bash

# For local-path storage
./install_singlenode_k3s_localpath.sh

# For Ceph storage (requires second disk)
./install_singlenode_k3s_ceph.sh

Issue: Network Connectivity Problems

Troubleshoot network connectivity and DNS resolution issues.

Symptoms

Cannot access GCM web interface
Pods cannot communicate with each other
DNS resolution failures
Timeout errors

Check Firewall

bash

# Check firewall status
firewall-cmd --state

# List open ports
firewall-cmd --list-all

# Open required ports if needed
firewall-cmd --permanent --add-port=30443/tcp
firewall-cmd --permanent --add-port=31443/tcp
firewall-cmd --reload

Test Connectivity

bash

# Test local access
curl -k https://localhost:31443
curl -k https://localhost:30443

# Test from another machine
curl -k https://:31443

# Check if ports are listening
ss -tlnp | grep -E '30443|31443'

# Test DNS resolution
nslookup 
ping

Check Service Endpoints

bash

# Check services
kubectl get svc -n gcm-system

# Check endpoints
kubectl get endpoints -n gcm-system

# Describe service for details
kubectl describe svc  -n gcm-system

DNS Issues

bash

# Check DNS configuration
cat /etc/resolv.conf

# Test DNS resolution
nslookup google.com
nslookup 

# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Restart CoreDNS if needed
kubectl rollout restart deployment/coredns -n kube-system

Issue: OIDC Authentication Errors

Troubleshoot OIDC provider and authentication issues.

Symptoms

Cannot login to GCM
OIDC redirect errors
"Invalid redirect URI" errors
Certificate errors

Check OIDC Service

bash

# Check OIDC pods
kubectl get pods -n gcm-system | grep oidc

# Check OIDC service
kubectl get svc -n gcm-system | grep oidc

# Test OIDC endpoint
curl -k https://localhost:30443/.well-known/openid-configuration

Fix OIDC FQDN Issues

If you're getting redirect URI errors, the OIDC provider may be using the wrong hostname.

bash

# Use the fix script
./fix_oidc_fqdn.sh

# Or manually update OIDC configuration
kubectl edit configmap oidc-config -n gcm-system

Check Certificates

bash

# Check certificates
kubectl get certificates -n gcm-system

# Check certificate details
kubectl describe certificate  -n gcm-system

# Check cert-manager logs
kubectl logs -n cert-manager -l app=cert-manager

Issue: Performance Problems

Troubleshoot slow response times and resource constraints.

Symptoms

Slow web interface
High CPU or memory usage
Pods being OOM killed
Disk I/O bottlenecks

Check Resource Usage

bash

# Check node resources
kubectl top nodes

# Check pod resources
kubectl top pods -n gcm-system

# Check for resource limits
kubectl describe pod  -n gcm-system | grep -A 5 Limits

# Check for OOM kills
dmesg | grep -i 'out of memory'

Check Disk I/O

bash

# Check disk usage
df -h

# Check I/O statistics
iostat -x 1 5

# Check for slow disks
dmesg | grep -i 'slow'

# If using Ceph, check OSD performance
kubectl exec -n rook-ceph deploy/rook-ceph-tools -- ceph osd perf

Solutions

Increase VM Resources:

Add more vCPUs (minimum 24, recommended 32+)
Increase RAM (minimum 48GB, recommended 64GB+)
Use faster storage (SSD recommended)

Optimize Configuration:

Use appropriate sizing model (xsmall, small, medium, large)
Adjust pod resource requests and limits
Enable resource quotas

Restart Services

How to restart GCM services without losing data.

Restart Individual Pods

bash

# Delete pod (will be recreated automatically)
kubectl delete pod  -n gcm-system

# Restart deployment
kubectl rollout restart deployment/ -n gcm-system

# Restart all GCM deployments
kubectl rollout restart deployment -n gcm-system

Restart K3s Service

bash

# Restart K3s
systemctl restart k3s

# Check K3s status
systemctl status k3s

# Wait for pods to come back
watch kubectl get pods -A

Warning: Restarting K3s will temporarily disrupt all services. Use this only when necessary.

Graceful Shutdown

Properly shut down GCM to prevent data corruption.

Using the Shutdown Script

bash

# Stop GCM (leave system running)
./shutdown_gcm_gracefully.sh

# Stop GCM and shutdown system
./shutdown_gcm_gracefully.sh --shutdown-system

Shutdown Order

The script follows this order to ensure clean shutdown:

GCM Application Pods (2-3 minutes)
Middleware Services - Redis, Kafka, MongoDB, PostgreSQL (3-5 minutes)
Operators (1-2 minutes)
Remaining Pods (1 minute)
Cert-Manager (1 minute)
Rook Ceph (optional, 2-3 minutes)
K3s (30 seconds)
System Shutdown (optional)

Total Time: 8-15 minutes depending on options selected.

Manual Shutdown

If the script is not available:

bash

# Scale down GCM deployments
kubectl scale deployment --all --replicas=0 -n gcm-system

# Stop K3s
systemctl stop k3s

# Shutdown system
shutdown -h now

Reset Installation

Start over with a clean installation.

Warning: This will DELETE ALL DATA. Make backups before proceeding!

Uninstall K3s

bash

# Uninstall K3s (removes everything)
/usr/local/bin/k3s-uninstall.sh

# Verify removal
systemctl status k3s
# Should show: Unit k3s.service could not be found

Clean Up Remaining Files

bash

# Remove K3s data
rm -rf /var/lib/rancher/k3s
rm -rf /etc/rancher/k3s

# Remove Rook Ceph data (if used)
rm -rf /var/lib/rook

# Clean up configuration
rm -rf /etc/gcm

Reinstall

After cleanup, follow the Quick Start guide to reinstall: