Back to home page

EIC code displayed by LXR

 
 

    


Warning, /swf-remote/docs/alarms.md is written in an unsupported language. File is not indexed.

0001 # swf-remote alarms
0002 
0003 Always-on proactive alarm capability for the ePIC PanDA production. A
0004 small standalone engine polls swf-monitor via swf-remote's loopback
0005 proxy every five minutes, persists everything in swf-remote's Postgres
0006 via a generic `Entry` table (tjai-style document-store), and ships email
0007 through AWS SES. The dashboard lives on the prod header menu, right of
0008 PCS; the per-alarm editor is CodeMirror with autosave and version history.
0009 
0010 ## Vocabulary
0011 
0012 We use **three** distinct terms. They are not synonyms.
0013 
0014 | Term                   | Meaning                                                                                                     | Scope |
0015 |------------------------|-------------------------------------------------------------------------------------------------------------|-------|
0016 | **Alarm**              | One configured condition — a module + a row in the DB + recipients. Fires **events** when matched.          | System noun. Never "check". |
0017 | **Renotification window** | On a still-firing alarm, how long to wait before re-emailing the same entity. 0 = one email per lifecycle. | Per-alarm attribute. |
0018 | **Since** (N hours / N days) | How far back to look. Two independent uses: the dashboard filter ("show events from the last N hours"), and the check's data lookback ("analyse PanDA jobs from the last N days"). Same word, different referents. | One is dashboard state; the other is per-alarm `params.since_days`. |
0019 
0020 ## What "disabled" means (per-alarm)
0021 
0022 Each alarm has a per-alarm `data.enabled` flag, surfaced in the editor
0023 as **Emails ON/OFF** and on the dashboard as the **Emails** column.
0024 
0025 **`enabled=True` (Emails ON):** the algorithm runs every tick, events
0026 fire into the DB, active/clear ticks, and emails are sent on new
0027 detections (and on renotification when the window elapses).
0028 
0029 **`enabled=False` (Emails OFF):** the algorithm **still runs every
0030 tick**. Events **still** fire into the DB. Active/clear **still**
0031 ticks. The dashboard **still** shows everything. **Only email delivery
0032 is suppressed.** No SES call is made. `last_notified` is not touched.
0033 The alarm is "silent" — monitoring stays operational, mail stops.
0034 
0035 This is the intended flow for tuning a new or noisy alarm: turn it on,
0036 watch the dashboard, confirm the detections look right, then flip
0037 Emails ON.
0038 
0039 **Stopping an alarm entirely** (algorithm does not run at all) is
0040 `archived=True`. That also hides the row from the live dashboard.
0041 `archived` is separate from `enabled`.
0042 
0043 There is **no global emails switch.** Per-alarm is the only control.
0044 
0045 ## Why this shape
0046 
0047 - **Standalone engine, not a Django management command.** See profile
0048   note `profile-standalone-over-django-mgmt-commands` — operational
0049   tools stay REST-fed, lightweight, and independent of one Django app's
0050   bootstrap.
0051 - **One DB.** swf-remote already runs on Postgres; alarm state goes in
0052   the same DB. No sqlite, no second store.
0053 - **Everything is an `Entry`.** The alarm config, each firing, each
0054   engine tick — all rows in the same tjai-faithful `entry` table.
0055   Adding a new customization on swf-remote (next project, whatever it
0056   is) = reuse the same table with a new `kind` value. `data` JSONField
0057   carries the per-kind metadata.
0058 - **Snowflake per alarm — no registry, no "kinds".** Each alarm has its
0059   own Python module at `alarms/swf_alarms/alarms/<name>.py` exposing
0060   `detect(client, params)`. The engine dispatches by importing the
0061   module whose name matches the alarm's entry_id. If two alarms share
0062   code, they share it by importing the same helper out of
0063   `alarms/swf_alarms/lib/`, not by being entries in a central dispatch
0064   table.
0065 - **State-based dedup (not cooldown timers).** One active event per
0066   (alarm, entity); while that event exists the engine bumps its
0067   `data.last_seen` without re-emailing (unless the per-alarm
0068   renotification window has elapsed). When the condition goes away, the
0069   engine sets `data.clear_time = now`. Next time it re-appears, a new
0070   event (and a new email) fires.
0071 - **Nav injection.** The alarm dashboard lives on the production header
0072   menu alongside PCS. swf-remote's own pages use a local base template;
0073   proxied swf-monitor pages (PanDA, PCS) get an `Alarms` link injected
0074   in `monitor_client.proxy()` the same way `nav-auth` is swapped.
0075 
0076 ## Architecture
0077 
0078 ```
0079   ┌──────────────────────┐
0080   │  swf-alarms engine   │  (cron */5 min)
0081   │  alarms/swf_alarms/  │
0082   │  standalone venv     │
0083   └───┬──────────────┬───┘
0084       │ https        │ psycopg
0085       │ (loopback)   │
0086       ▼              ▼
0087   ┌─────────────────────┐   ┌──────────────────────────┐
0088   │  swf-remote Django  │   │  Postgres (swf_remote)   │
0089   │  /prod/api/panda/*  │──►│  entry, entry_context,   │
0090   │  /prod/alarms/      │   │  entry_version           │
0091   └──────────┬──────────┘   └──────────────────────────┘
0092              │ SSH tunnel
00930094   ┌─────────────────────┐   ┌──────────────────────────┐
0095   │  swf-monitor (BNL)  │   │  AWS SES                 │
0096   │  /api/panda/tasks/… │   │  alarm emails            │
0097   └─────────────────────┘   └──────────────────────────┘
0098 ```
0099 
0100 ## Entry conventions used by alarms
0101 
0102 All rows live in context `swf-alarms` (except teams, which live in
0103 `teams`). Rows are filtered out of live views when `archived=True`
0104 (explicit boolean, separate from `status`).
0105 
0106 | kind          | data.entry_id               | What it represents |
0107 |---------------|------------------------------|--------------------|
0108 | `alarm`       | `alarm_<name>`               | One configured alarm. `content` is the description / email body. `data.params` holds thresholds etc. `data.recipients` routes emails. `data.enabled` gates **email delivery only** — the algorithm always runs. `data.renotification_window_hours` controls re-email. |
0109 | `event`       | `event_<name>` (NON-UNIQUE)  | One firing instance. Many rows share the same `entry_id`. `data.fire_time` set when created, `data.clear_time` null=active, set=cleared. `data.dedupe_key` identifies the entity (e.g. task id). `content` is the email body sent when this fired. |
0110 | `engine_run`  | `run_<unix_ts>`              | One engine tick. `data` holds aggregate counters, `data.per_alarm` carries per-alarm detail, any error trace. |
0111 
0112 Multiple event rows share `data.entry_id` — that's deliberate. `entry_id`
0113 identifies the alarm type; the Entry's UUID distinguishes instances.
0114 
0115 ## Alarm config `data` shape
0116 
0117 Top-level keys on `data` are engine-universal (same for every alarm):
0118 
0119 - `entry_id`       — `alarm_<name>`, matches the module filename.
0120 - `enabled`        — boolean. Per-alarm **email switch**. When False
0121                      the algorithm still runs and events still fire —
0122                      only email delivery is suppressed. See "What
0123                      'disabled' means" above.
0124 - `recipients`     — string or list; emails and/or `@team` references.
0125 - `renotification_window_hours` — float; 0 means one email per lifecycle.
0126 - `params`         — nested dict; **per-alarm** keys consumed by that
0127                      alarm's `detect()`. The alarm module declares its
0128                      PARAMS surface (see below).
0129 
0130 ## Engine loop (per tick)
0131 
0132 1. Load `kind='alarm'` entries where `archived=False` **regardless of
0133    `data.enabled`**. The algorithm always runs; `enabled` only controls
0134    the email side.
0135 2. For each alarm config:
0136    a. Fetch current active events (clear_time null) for this alarm.
0137    b. Import `swf_alarms.alarms.<name>` and call its `detect(client, params)`.
0138    c. For each detection:
0139       - `dedupe_key` in active-events map → bump `last_seen`. If this
0140         alarm's emails are on AND (the event has never been notified
0141         OR the renotification window has elapsed since `last_notified`),
0142         add it to this alarm's **renotify bundle**.
0143       - Otherwise → create a new `kind='event'` row (fire_time=now,
0144         clear_time=null), store a single-detection body on it (the
0145         event-detail page reads from this), and add it to this alarm's
0146         **new bundle**.
0147    d. For each previously-active event whose `dedupe_key` is NOT in
0148       this tick's detections (and the alarm didn't error), set
0149       `data.clear_time = now`. Auto-clear (unconditional of `enabled`).
0150 3. If this alarm's emails are on AND the bundle is non-empty: ship **one
0151    SES email** covering all new + renotifying detections. On success,
0152    stamp `last_notified = now` on every event included in the bundle.
0153    `notifications_sent` in the engine-run counters increments by one
0154    per bundle, regardless of how many detections the bundle carried.
0155 4. Close out the `engine_run` entry with counters + `data.per_alarm`
0156    (which includes `bundle_new`, `bundle_renotify`, `bundle_sent`).
0157 
0158 **One email per alarm per tick**, never one-per-detection. When a tick
0159 tripped N tasks, you receive a single email listing all N — not N
0160 emails.
0161 
0162 Transient fetch failure on one alarm does NOT auto-clear that alarm's
0163 active events — the last known state is preserved until the next
0164 successful tick.
0165 
0166 ## Dashboard
0167 
0168 At `/prod/alarms/`. Parts:
0169 
0170 1. **Engine health banner** — ok / warn / bad / unknown, from last
0171    `engine_run` finished time and error count. Shows seconds until the
0172    next */5 boundary.
0173 2. **Teams** — reusable recipient aliases. `@<teamname>` references
0174    expand to member emails at send time. Editor is its own page.
0175 3. **Summary table** — one row per alarm config: name (link to section),
0176    enabled, alarms-since-N-hours (N user-settable via the
0177    `Since` filter, default 24), currently-active count, last-fired
0178    time. A yellow **quiet** badge appears next to alarm names that saw
0179    zero detections in the last few runs despite prior history — a
0180    heuristic for silently-broken alarms.
0181 4. **Per-alarm section** (one per active alarm config):
0182    - Header: name, `[Edit]` button.
0183    - Metadata table: entry_id, created/modified, recipients, params.
0184    - Body/description card.
0185    - Events-since-N-hours table (reverse chron): fire, clear, state,
0186      dedupe key, subject (link to event detail).
0187 5. **Recent engine runs table** — counters per run, per-alarm
0188    breakdown, errors highlighted.
0189 
0190 ## Editor — `/prod/alarms/<entry_id>/edit/`
0191 
0192 CodeMirror 5 (markdown mode, material-darker theme) on the alarm's
0193 `content` (description / email body). JSON-mode CodeMirror on
0194 `params`. First-class form fields for enabled, recipients,
0195 renotification window.
0196 
0197 Features:
0198 
0199 - **PARAMS help panel** — the alarm module declares a `PARAMS` dict
0200   (name → type / required / default / description); the editor renders
0201   it as a table above the JSON box so you can see what keys this
0202   specific alarm actually reads.
0203 - **[Test (live, no email)]** — runs the alarm's `detect()` once with
0204   the current in-editor params against live data, shows all detections
0205   in-page. Never emails. Uses the editor's unsaved values so you can
0206   try before saving.
0207 - **[Preview email body]** — composes the email body (description +
0208   a synthetic detection context) so you can see what a real notification
0209   would look like.
0210 - **Autosave** every 10s via POST (JSON body). Also on Ctrl/Cmd-S, and
0211   on `beforeunload` via `navigator.sendBeacon`.
0212 - **localStorage backup** on every keystroke. If the browser crashes
0213   or the server is unreachable, the backup is visible as a "local" row
0214   in the version-history table with a `[Restore]` button.
0215 - **Version history table** — server-side versions (rendered inline on
0216   page load) with click-to-load. The server creates an `EntryVersion`
0217   row automatically via the `pre_save` signal whenever content or
0218   substantive `data` changes (noise keys like `last_seen` are filtered
0219   out so autosave doesn't spam version rows).
0220 
0221 All server-side edits go through `alarm_views.alarm_config_save`; the
0222 Entry's pre_save signal handles versioning transparently.
0223 
0224 ## Nav "Alarms" link
0225 
0226 Right of PCS, on every production-mode page:
0227 
0228 - **swf-remote native pages** (alarm dashboard, editor, event detail):
0229   `src/templates/base.html` has the link in the header nav directly.
0230 - **Proxied swf-monitor pages** (PanDA, PCS, hubs):
0231   `monitor_client.proxy()` injects the link inside the
0232   `<span class="nav-mode nav-production">…</span>` block — same
0233   mechanism that swaps `nav-auth`.
0234 
0235 ## Adding a new alarm
0236 
0237 There is no "new alarm" button in the UI — alarms are algorithms over
0238 data, not configuration-only records. Adding one is a code + DB + cron
0239 operation by a developer. The mechanism, end to end:
0240 
0241 1. **Write the module.** Create
0242    `alarms/swf_alarms/alarms/<name>.py` exposing:
0243 
0244    ```python
0245    from ..lib import Detection
0246 
0247    PARAMS = {
0248        "threshold": {"type": float, "required": True,
0249                      "description": "fire when X exceeds this"},
0250        "since_days": {"type": int, "default": 1,
0251                       "description": "look back this many days"},
0252    }
0253 
0254    def detect(client, params):
0255        # ... query data via `client`, yield Detection(...) per entity ...
0256        yield Detection(
0257            dedupe_key="…",  # stable per-entity
0258            subject="…",     # email subject + dashboard row
0259            body_context="…",# appended to the alarm's description
0260            extra_data={},   # structured context for the event row
0261        )
0262    ```
0263 
0264    The contract: `detect` must not email, must not raise on transient
0265    failures (log and yield nothing), and must set a stable `dedupe_key`
0266    per entity so state-based dedup works.
0267 
0268 2. **Share helpers, not dispatch.** If the algorithm is similar to an
0269    existing one, import a helper from `alarms/swf_alarms/lib/`. Do
0270    **not** add a central registry entry — there is no registry.
0271 
0272 3. **Create the DB config.** Add an `Entry` row via a data migration
0273    (preferred: reproducible) or Django shell. Schema:
0274 
0275    ```python
0276    Entry(
0277        kind='alarm',
0278        context=<EntryContext name='swf-alarms'>,
0279        title="Human-readable title",
0280        content="Description that prefixes the email body…",
0281        data={
0282            'entry_id': 'alarm_<name>',          # must match module
0283            'enabled': True,
0284            'recipients': ['@prodops', 'alice@example.com'],
0285            'renotification_window_hours': 24,
0286            'params': { ... keys from PARAMS ... },
0287        },
0288        status='active',
0289        archived=False,
0290    )
0291    ```
0292 
0293 4. **Pick it up on the next tick.** The engine runs every 5 minutes
0294    via cron (`alarms/deploy/crontab.example`). New modules are picked
0295    up automatically by the next tick — no engine restart required, no
0296    redeploy required. If you want to run it immediately:
0297 
0298    ```bash
0299    /home/admin/github/swf-remote/alarms/.venv/bin/swf-alarms-run \
0300      --config /home/admin/github/swf-remote/alarms/config.toml --dry-run -v
0301    ```
0302 
0303    (Drop `--dry-run` to send real emails.)
0304 
0305 5. **Django side picks up the PARAMS help immediately.** The editor
0306    imports the alarm module to render its PARAMS help panel, so as
0307    soon as the dev tree is deployed via `deploy/update_from_dev.sh`,
0308    the editor shows the new alarm's param surface.
0309 
0310 Removing an alarm: set `enabled=False` (keeps history visible), or
0311 `archived=True` (hides from dashboard). The module file can stay — it's
0312 harmless code until referenced by an Entry.
0313 
0314 ## Adding a new channel
0315 
0316 Add `send_<channel>(alarm, **cfg) -> bool` in `alarms/swf_alarms/notify.py`.
0317 Wire it into `run.py` behind a per-alarm or global `channels` config knob.
0318 Failures must return False (never raise) so one stuck channel doesn't
0319 cascade.
0320 
0321 ## Files (where the code lives)
0322 
0323 **Django side** (`src/remote_app/`):
0324 
0325 | File | Purpose |
0326 |---|---|
0327 | `models.py` | `EntryContext`, `Entry`, `EntryVersion`. tjai-faithful fields; `archived` boolean; pinned `db_table` names. |
0328 | `migrations/0001_initial.py` | Schema. |
0329 | `migrations/0002_seed_alarms.py` | Seeds `swf-alarms` context + initial alarm configs. |
0330 | `migrations/0003_seed_teams.py` | Seeds the `teams` context + `@prodops`; adds `renotification_window_hours` to existing alarms. |
0331 | `migrations/0005_drop_alarm_kind.py` | Drops legacy `data.kind` from alarm rows (pre-snowflake residue). |
0332 | `migrations/0006_rename_days_window.py` | Renames `data.params.days_window` → `since_days` on existing alarm rows. |
0333 | `signals.py` | `pre_save` snapshot on Entry. |
0334 | `alarms_data.py` | ORM query helpers. Functions named `events_since`, `count_events_since`, `quiet_alarms`. |
0335 | `alarm_views.py` | `alarms_dashboard`, `alarm_event_detail`, `alarm_config_edit/save/version`, `alarm_test`, `alarm_preview`, team views. Reads alarm modules' `PARAMS` for the editor help panel. |
0336 | `views.py` | Re-exports alarm views. |
0337 | `urls.py` | Alarm routes. |
0338 | `templates/monitor_app/alarms.html` | Dashboard. |
0339 | `templates/monitor_app/alarm_config_edit.html` | Editor. |
0340 | `templates/monitor_app/alarm_event_detail.html` | Single firing detail. |
0341 | `templates/monitor_app/team_edit.html` | Team editor. |
0342 | `templatetags/swf_fmt.py` | `fmt_dt` and `state_class`. |
0343 | `monitor_client.py` | Alarms-link injection in proxied HTML. |
0344 | `src/templates/base.html` | Prod-style header nav + Alarms link. |
0345 
0346 **Engine side** (`alarms/`):
0347 
0348 | File | Purpose |
0349 |---|---|
0350 | `swf_alarms/config.py` | TOML loader — engine-level settings + DB DSN. |
0351 | `swf_alarms/db.py` | psycopg layer over `entry`/`entry_context`/`entry_version`. Helpers: `list_alarm_configs`, `active_events_for_alarm`, `create_event`, `touch_event_last_seen`, `clear_event`, `start/finish_engine_run`. |
0352 | `swf_alarms/fetch.py` | HTTP client for the swf-monitor REST. |
0353 | `swf_alarms/lib/__init__.py` | `Detection` dataclass — the value-type alarm modules yield. |
0354 | `swf_alarms/lib/failure_rate.py` | Shared PanDA-task failure-rate helper + its `PARAMS`. |
0355 | `swf_alarms/alarms/<name>.py` | One snowflake alarm module per configured alarm. Currently: `panda_failure_rate_sakib`, `panda_failure_rate_eic_all`. |
0356 | `swf_alarms/run.py` | Engine entry point. Loads configs, drives active/clear semantics, writes events + engine_run entries. Supports `--dry-run`. |
0357 | `swf_alarms/notify.py` | SES send. Channel failures log but never raise. |
0358 | `deploy/install.sh` | venv + log dir. Schema is owned by swf-remote migrations. |
0359 | `deploy/crontab.example` | */5 min cadence. |
0360 | `config.toml.example` | Engine, DB, email only. |
0361 | `pyproject.toml` | `httpx`, `boto3`, `psycopg[binary]`. |
0362 
0363 ## Future work
0364 
0365 - Mattermost channel (SES-parallel).
0366 - Per-task-owner routing (lookup task username → recipient at
0367   event-create time; config recipients list becomes a fallback).
0368 - Acknowledgement: ack button on active events to suppress notify
0369   without waiting for auto-clear.
0370 - Time-bucket charts per alarm (events/hour over last N days).