"cause_detailed":"Proxmox VE uses a subscription model for enterprise features. Without a valid subscription key, access to the enterprise repository is denied. This is normal for home/lab users.",
"severity":"info",
"solution":"Use no-subscription repository or purchase subscription",
"solution_detailed":"For home/lab use: Switch to the no-subscription repository by editing /etc/apt/sources.list.d/pve-enterprise.list. For production: Purchase a subscription at proxmox.com/pricing",
"cause_detailed":"Corosync cluster requires more than 50% of configured votes to maintain quorum. When quorum is lost, the cluster becomes read-only to prevent split-brain scenarios.",
"severity":"critical",
"solution":"Check network connectivity between nodes; ensure majority of nodes are online",
"solution_detailed":"1. Verify network connectivity: ping all cluster nodes\n2. Check corosync status: systemctl status corosync\n3. View cluster status: pvecm status\n4. If nodes are unreachable, check firewall rules (ports 5405-5412 UDP)\n5. For emergency single-node operation: pvecm expected 1",
"cause_detailed":"The Corosync QDevice provides an additional vote for 2-node clusters. When it cannot connect, the cluster may lose quorum if one node fails.",
"severity":"warning",
"solution":"Check QDevice server connectivity and corosync-qnetd service",
"solution_detailed":"1. Verify QDevice server is running: systemctl status corosync-qnetd (on QDevice host)\n2. Check connectivity: nc -zv <qdevice-ip> 5403\n3. Restart qdevice: systemctl restart corosync-qdevice\n4. Check certificates: corosync-qdevice-net-certutil -s",
"cause":"Network latency or packet loss between cluster nodes",
"cause_detailed":"Corosync uses multicast/unicast for cluster communication. High latency, packet loss, or network congestion causes token timeouts and retransmissions, potentially leading to node eviction.",
"severity":"warning",
"solution":"Check network quality between nodes; consider increasing token timeout",
"solution_detailed":"1. Test network latency: ping -c 100 <other-node>\n2. Check for packet loss between nodes\n3. Verify MTU settings match on all interfaces\n4. Increase token timeout in /etc/pve/corosync.conf if needed (default 1000ms)\n5. Check switch/router for congestion",
"cause":"Disk SMART health check failed - disk is failing",
"cause_detailed":"SMART (Self-Monitoring, Analysis and Reporting Technology) detected critical disk health issues. The disk is likely failing and data loss is imminent.",
"severity":"critical",
"solution":"IMMEDIATELY backup data and replace disk",
"solution_detailed":"1. URGENT: Backup all data from this disk immediately\n2. Check SMART details: smartctl -a /dev/sdX\n3. Note the failing attributes (Reallocated_Sector_Ct, Current_Pending_Sector, etc.)\n4. Plan disk replacement\n5. If in RAID/ZFS: initiate disk replacement procedure",
"cause":"Disk has excessive bad sectors being remapped",
"cause_detailed":"The disk firmware has remapped multiple bad sectors to spare areas. While the disk is still functioning, this indicates physical degradation and eventual failure.",
"severity":"warning",
"solution":"Monitor closely and plan disk replacement",
"solution_detailed":"1. Check current value: smartctl -A /dev/sdX | grep Reallocated\n2. If value is increasing, plan immediate replacement\n3. Backup important data\n4. Run extended SMART test: smartctl -t long /dev/sdX",
"cause_detailed":"The SATA/ATA controller encountered communication errors with the disk. This can indicate cable issues, controller problems, or disk failure.",
"severity":"warning",
"solution":"Check SATA cables and connections; verify disk health with smartctl",
"solution_detailed":"1. Check SMART health: smartctl -H /dev/sdX\n2. Inspect and reseat SATA cables\n3. Try different SATA port\n4. Check dmesg for pattern of errors\n5. If errors persist, disk may be failing",
"cause_detailed":"The kernel failed to read or write data to the disk. This can be caused by disk failure, cable issues, or filesystem corruption.",
"severity":"critical",
"solution":"Check disk health and connections immediately",
"solution_detailed":"1. Check SMART status: smartctl -H /dev/sdX\n2. Check dmesg for related errors: dmesg | grep -i error\n3. Verify disk is still accessible: lsblk\n4. If ZFS: check pool status with zpool status\n5. Consider filesystem check if safe to unmount",
"cause_detailed":"One or more devices in the ZFS pool are unavailable or experiencing errors. The pool is still functional but without full redundancy.",
"severity":"warning",
"solution":"Identify failed device with 'zpool status' and replace",
"solution_detailed":"1. Check pool status: zpool status <pool>\n2. Identify the DEGRADED or UNAVAIL device\n3. If device is present but erroring: zpool scrub <pool>\n4. To replace: zpool replace <pool> <old-device> <new-device>\n5. Monitor resilver progress: zpool status",
"cause_detailed":"The ZFS pool has lost too many devices and cannot maintain data integrity. Data may be inaccessible.",
"severity":"critical",
"solution":"Check failed devices; may need data recovery",
"solution_detailed":"1. Check status: zpool status <pool>\n2. Identify all failed devices\n3. Attempt to online devices: zpool online <pool> <device>\n4. If drives are physically present, try zpool clear <pool>\n5. May require data recovery if multiple drives failed",
"cause_detailed":"Ceph detected issues that don't prevent operation but should be addressed. Common causes: degraded PGs, clock skew, full OSDs.",
"severity":"warning",
"solution":"Run 'ceph health detail' for specific issues",
"solution_detailed":"1. Get details: ceph health detail\n2. Common fixes:\n - Degraded PGs: wait for recovery or add capacity\n - Clock skew: sync NTP on all nodes\n - Full OSDs: add storage or delete data\n3. Check: ceph status",
"category":"storage"
},
{
"pattern":r"ceph.*health.*ERR|HEALTH_ERR",
"cause":"Ceph cluster has critical errors",
"cause_detailed":"Ceph has detected critical issues that may affect data availability or integrity. Immediate attention required.",
"severity":"critical",
"solution":"Run 'ceph health detail' and address errors immediately",
"solution_detailed":"1. Get details: ceph health detail\n2. Check OSD status: ceph osd tree\n3. Check MON status: ceph mon stat\n4. View PG status: ceph pg stat\n5. Address each error shown in health detail",
"pattern":r"TASK ERROR.*failed to get exclusive lock|lock.*timeout|couldn't acquire lock",
"cause":"Resource is locked by another operation",
"cause_detailed":"Another task is currently holding a lock on this VM/CT. This prevents concurrent modifications that could cause corruption.",
"severity":"info",
"solution":"Wait for other task to complete or check for stuck tasks",
"solution_detailed":"1. Check running tasks: cat /var/log/pve/tasks/active\n2. Wait for task completion\n3. If task is stuck (>1h), check process: ps aux | grep <vmid>\n4. As last resort, remove lock file: rm /var/lock/qemu-server/lock-<vmid>.conf",
"cause":"KVM/hardware virtualization not available",
"cause_detailed":"The CPU's hardware virtualization extensions (Intel VT-x or AMD-V) are either not supported, not enabled in BIOS, or blocked by another hypervisor.",
"severity":"warning",
"solution":"Enable VT-x/AMD-V in BIOS settings",
"solution_detailed":"1. Reboot into BIOS/UEFI\n2. Find Virtualization settings (often in CPU or Advanced section)\n3. Enable Intel VT-x or AMD-V/SVM\n4. Save and reboot\n5. Verify: grep -E 'vmx|svm' /proc/cpuinfo",
"category":"vms"
},
{
"pattern":r"out of memory|OOM.*kill|cannot allocate memory|memory.*exhausted",
"cause":"System or VM ran out of memory",
"cause_detailed":"The Linux OOM (Out Of Memory) killer terminated a process to free memory. This indicates memory pressure from overcommitment or memory leaks.",
"severity":"critical",
"solution":"Increase memory allocation or reduce VM memory usage",
"solution_detailed":"1. Check what was killed: dmesg | grep -i oom\n2. Review memory usage: free -h\n3. Check balloon driver status for VMs\n4. Consider adding swap or RAM\n5. Review VM memory allocations for overcommitment",
"cause_detailed":"One or more physical interfaces in a network bond have lost link. Depending on bond mode, this may reduce bandwidth or affect failover.",
"severity":"warning",
"solution":"Check physical cable connections and switch ports",
"solution_detailed":"1. Check bond status: cat /proc/net/bonding/bond0\n2. Identify down slave interface\n3. Check physical cable connection\n4. Check switch port status and errors\n5. Verify interface: ethtool <slave-iface>",
"cause_detailed":"The physical or virtual network interface has lost its connection. This could be a cable issue, switch problem, or driver issue.",
"severity":"warning",
"solution":"Check cable, switch port, and interface status",
"solution_detailed":"1. Check interface: ip link show <iface>\n2. Check cable connection\n3. Check switch port LEDs\n4. Try: ip link set <iface> down && ip link set <iface> up\n5. Check driver: ethtool -i <iface>",
"cause_detailed":"STP detected a potential network loop and blocked a bridge port to prevent broadcast storms. This is normal behavior but may indicate network topology issues.",
"severity":"info",
"solution":"Review network topology; this may be expected behavior",
"solution_detailed":"1. Check bridge status: brctl show\n2. View STP state: brctl showstp <bridge>\n3. If unexpected, review network topology for loops\n4. Consider disabling STP if network is simple: brctl stp <bridge> off",
"cause_detailed":"A scheduled or manual backup operation failed. Common causes: storage full, VM locked, network issues for remote storage.",
"severity":"warning",
"solution":"Check backup storage space and VM status",
"solution_detailed":"1. Check backup log in Datacenter > Backup\n2. Verify storage space: df -h\n3. Check if VM is locked: qm list or pct list\n4. Verify backup storage is accessible\n5. Try manual backup to identify specific error",
"cause_detailed":"An SSL certificate used for secure communication has passed its expiration date. This may cause connection failures or security warnings.",
"severity":"warning",
"solution":"Renew the certificate using pvenode cert set or Let's Encrypt",
"solution_detailed":"1. Check certificate: pvenode cert info\n2. For self-signed renewal: pvecm updatecerts\n3. For Let's Encrypt: pvenode acme cert order\n4. Restart pveproxy after renewal: systemctl restart pveproxy",
"cause_detailed":"A hardware component (CPU, disk, etc.) has reached a dangerous temperature. Sustained high temperatures can cause hardware damage or system shutdowns.",
"severity":"critical",
"solution":"Check cooling system immediately; clean dust, verify fans",
"solution_detailed":"1. Check current temps: sensors\n2. Verify all fans are running\n3. Clean dust from heatsinks and filters\n4. Ensure adequate airflow\n5. Consider reapplying thermal paste if CPU\n6. Check ambient room temperature",
"cause_detailed":"A login attempt failed due to invalid credentials or permissions. Multiple failures may indicate a brute-force attack.",
"severity":"info",
"solution":"Verify credentials; check for unauthorized access attempts",
"solution_detailed":"1. Review auth logs: journalctl -u pvedaemon | grep auth\n2. Check for multiple failures from same IP\n3. Verify user exists: pveum user list\n4. If attack suspected, consider fail2ban\n5. Reset password if needed: pveum passwd <user>",