Files
ProxMenux/AppImage/ProxMenux-Monitor.AppImage.sha256

2 lines
65 B
Plaintext
Raw Normal View History

Health Monitor: reconcile stale disk warnings across reboots When a host gets transient I/O events on a disk while smartctl is momentarily unavailable (the canonical case: late in a noisy shutdown), the disk-scan code records a `disk_<name>` WARNING tagged "SMART: unavailable" exactly once and trusts the next scan to clear it. That trust is misplaced: the clear path only fires when the device shows up in the current dmesg window with zero events. After a reboot, dmesg is empty for that device — so the device never gets iterated, resolve_error is never called, and the dashboard stays orange for a disk whose SMART now reports PASSED. Caught on a lab host where `disk_nvme2n1` had been stuck as WARNING for hours after a reboot. SMART was 100% healthy at the moment of inspection (Critical Warning 0x00, 0 media errors, 100% spare). The error's first_seen and last_seen were identical and pre-dated the current boot, confirming a one-shot record that nothing had cleared. Fix: add a `_reconcile_stale_disk_warnings()` pass at the top of `_check_disks_optimized()`. For every active `disk_*` error (skipping `disk_fs_*`, which is already reconciled separately): - device gone from /dev/ → resolve "Device no longer present" - device present + SMART PASSED → resolve "Transient I/O cleared, SMART now reports healthy" - device present + SMART UNKNOWN/FAILED → leave active so the main loop can re-classify on the next dmesg window Acknowledged errors are left alone so the user's explicit dismiss intent isn't overridden. Verified end-to-end: re-injected the original `disk_nvme2n1` warning into the persistence DB on the lab host, waited one scan cycle, error was resolved automatically with `resolved_at` set and `resolution_reason = 'Transient I/O cleared, SMART now reports healthy'`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 22:54:14 +02:00
dc0f267de13f20ba7035df442ddaa22bea6e4d26cbaa542a5246ba88c796ffad