Health Monitor: reconcile stale disk warnings across reboots

When a host gets transient I/O events on a disk while smartctl is momentarily unavailable (the canonical case: late in a noisy shutdown), the disk-scan code records a `disk_<name>` WARNING tagged "SMART: unavailable" exactly once and trusts the next scan to clear it. That trust is misplaced: the clear path only fires when the device shows up in the current dmesg window with zero events. After a reboot, dmesg is empty for that device — so the device never gets iterated, resolve_error is never called, and the dashboard stays orange for a disk whose SMART now reports PASSED. Caught on a lab host where `disk_nvme2n1` had been stuck as WARNING for hours after a reboot. SMART was 100% healthy at the moment of inspection (Critical Warning 0x00, 0 media errors, 100% spare). The error's first_seen and last_seen were identical and pre-dated the current boot, confirming a one-shot record that nothing had cleared. Fix: add a `_reconcile_stale_disk_warnings()` pass at the top of `_check_disks_optimized()`. For every active `disk_*` error (skipping `disk_fs_*`, which is already reconciled separately): - device gone from /dev/ → resolve "Device no longer present" - device present + SMART PASSED → resolve "Transient I/O cleared, SMART now reports healthy" - device present + SMART UNKNOWN/FAILED → leave active so the main loop can re-classify on the next dmesg window Acknowledged errors are left alone so the user's explicit dismiss intent isn't overridden. Verified end-to-end: re-injected the original `disk_nvme2n1` warning into the persistence DB on the lab host, waited one scan cycle, error was resolved automatically with `resolved_at` set and `resolution_reason = 'Transient I/O cleared, SMART now reports healthy'`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-11 11:06:24 +00:00 · 2026-06-01 22:54:14 +02:00
parent d25faedc2b
commit 9677c5cb19
3 changed files with 87 additions and 3 deletions
--- a/AppImage/ProxMenux-1.2.2.AppImage
+++ b/AppImage/ProxMenux-1.2.2.AppImage
--- a/AppImage/ProxMenux-Monitor.AppImage.sha256
+++ b/AppImage/ProxMenux-Monitor.AppImage.sha256
@@ -1 +1 @@
-4602b8d4aa130c6f3cb017358d8459b473a5f05d64152fe13200b241932a73a8
+dc0f267de13f20ba7035df442ddaa22bea6e4d26cbaa542a5246ba88c796ffad
--- a/AppImage/scripts/health_monitor.py
+++ b/AppImage/scripts/health_monitor.py
@@ -2361,18 +2361,102 @@ class HealthMonitor:
        except Exception:
            return fallback
    def _reconcile_stale_disk_warnings(self) -> None:
        """
        Reconcile persisted disk_<name> warnings against the current host
        state before each disk scan.
        The disk-scan loop only resolves an error when the device appears
        in the current dmesg window with zero events. After a reboot,
        dmesg is empty for that device, so the loop never iterates it,
        and a `disk_<name>` WARNING recorded as "SMART: unavailable"
        during a noisy shutdown can stay active forever — the dashboard
        keeps showing an orange "Warning" badge for a disk whose SMART
        is in fact PASSED.
        This pass walks the active disk_* errors (skipping disk_fs_*,
        which is already reconciled separately below) and:
          - device gone from /dev/ → resolve as "Device no longer present"
          - device present + SMART now PASSED → resolve as "Transient
            I/O cleared, SMART now healthy"
          - device present + SMART still unavailable → leave warning
            active (the original condition is still ambiguous)
          - device present + SMART FAILED → leave warning active (the
            main loop will pick it up and may upgrade to CRITICAL)
        """
        try:
            active = health_persistence.get_active_errors(category='disks')
        except Exception:
            return
        for err in active:
            err_key = err.get('error_key', '') or ''
            # Skip the filesystem-mount errors — the dedicated block
            # below handles them with its own reconciliation rules.
            if not err_key.startswith('disk_') or err_key.startswith('disk_fs_'):
                continue
            # Don't disturb errors the user explicitly acknowledged.
            if err.get('acknowledged') == 1:
                continue
            details = err.get('details', {})
            if isinstance(details, str):
                try:
                    details = json.loads(details)
                except Exception:
                    details = {}
            # Recover the block device name. Prefer the structured
            # `block_device` field; fall back to `disk` or derive from
            # the error_key (`disk_nvme2n1` → `nvme2n1`).
            base_disk = (
                details.get('block_device') or
                details.get('disk') or
                err_key[len('disk_'):]
            )
            if not base_disk:
                continue
            dev_path = f'/dev/{base_disk}'
            if not os.path.exists(dev_path):
                try:
                    health_persistence.resolve_error(
                        err_key, 'Device no longer present in system')
                except Exception:
                    pass
                continue
            # Device exists — query SMART. _quick_smart_health returns
            # 'PASSED' / 'FAILED' / 'UNKNOWN'.
            try:
                smart_health = self._quick_smart_health(base_disk)
            except Exception:
                smart_health = 'UNKNOWN'
            if smart_health == 'PASSED':
                try:
                    health_persistence.resolve_error(
                        err_key,
                        'Transient I/O cleared, SMART now reports healthy')
                except Exception:
                    pass
            # else: smart UNKNOWN or FAILED — leave active and let the
            # main loop classify it on the next dmesg window.
    def _check_disks_optimized(self) -> Dict[str, Any]:
        """
        Disk I/O error check -- the SINGLE source of truth for disk errors.
-        
+
        Reads dmesg for I/O/ATA/SCSI errors, counts per device, records in
        health_persistence, and returns status for the health dashboard.
        Resolves ATA controller names (ata8) to physical disks (sda).
-        
+
        Cross-references SMART health to avoid false positives from transient
        ATA controller errors. If SMART reports PASSED, dmesg errors are
        downgraded to INFO (transient).
        """
        # Reconcile any disk_<name> warnings persisted across a noisy
        # shutdown / reboot before the main scan starts. Without this
        # pass the main loop only resolves errors for devices that show
        # fresh events in the current dmesg window — devices that simply
        # disappeared from dmesg stay flagged indefinitely.
        self._reconcile_stale_disk_warnings()
        current_time = time.time()
        disk_results = {}  # Single dict for both WARNING and CRITICAL
`@@ -1 +1 @@`
	`4602b8d4aa130c6f3cb017358d8459b473a5f05d64152fe13200b241932a73a8`	`dc0f267de13f20ba7035df442ddaa22bea6e4d26cbaa542a5246ba88c796ffad`