Warning, /swf-remote/docs/alarms.md is written in an unsupported language. File is not indexed.
0001 # swf-remote alarms
0002
0003 Always-on proactive alarm capability for the ePIC PanDA production. A
0004 small standalone engine polls swf-monitor via swf-remote's loopback
0005 proxy every five minutes, persists everything in swf-remote's Postgres
0006 via a generic `Entry` table (tjai-style document-store), and ships email
0007 through AWS SES. The dashboard lives on the prod header menu, right of
0008 PCS; the per-alarm editor is CodeMirror with autosave and version history.
0009
0010 ## Vocabulary
0011
0012 We use **three** distinct terms. They are not synonyms.
0013
0014 | Term | Meaning | Scope |
0015 |------------------------|-------------------------------------------------------------------------------------------------------------|-------|
0016 | **Alarm** | One configured condition — a module + a row in the DB + recipients. Fires **events** when matched. | System noun. Never "check". |
0017 | **Renotification window** | On a still-firing alarm, how long to wait before re-emailing the same entity. 0 = one email per lifecycle. | Per-alarm attribute. |
0018 | **Since** (N hours / N days) | How far back to look. Two independent uses: the dashboard filter ("show events from the last N hours"), and the check's data lookback ("analyse PanDA jobs from the last N days"). Same word, different referents. | One is dashboard state; the other is per-alarm `params.since_days`. |
0019
0020 ## What "disabled" means (per-alarm)
0021
0022 Each alarm has a per-alarm `data.enabled` flag, surfaced in the editor
0023 as **Emails ON/OFF** and on the dashboard as the **Emails** column.
0024
0025 **`enabled=True` (Emails ON):** the algorithm runs every tick, events
0026 fire into the DB, active/clear ticks, and emails are sent on new
0027 detections (and on renotification when the window elapses).
0028
0029 **`enabled=False` (Emails OFF):** the algorithm **still runs every
0030 tick**. Events **still** fire into the DB. Active/clear **still**
0031 ticks. The dashboard **still** shows everything. **Only email delivery
0032 is suppressed.** No SES call is made. `last_notified` is not touched.
0033 The alarm is "silent" — monitoring stays operational, mail stops.
0034
0035 This is the intended flow for tuning a new or noisy alarm: turn it on,
0036 watch the dashboard, confirm the detections look right, then flip
0037 Emails ON.
0038
0039 **Stopping an alarm entirely** (algorithm does not run at all) is
0040 `archived=True`. That also hides the row from the live dashboard.
0041 `archived` is separate from `enabled`.
0042
0043 There is **no global emails switch.** Per-alarm is the only control.
0044
0045 ## Why this shape
0046
0047 - **Standalone engine, not a Django management command.** See profile
0048 note `profile-standalone-over-django-mgmt-commands` — operational
0049 tools stay REST-fed, lightweight, and independent of one Django app's
0050 bootstrap.
0051 - **One DB.** swf-remote already runs on Postgres; alarm state goes in
0052 the same DB. No sqlite, no second store.
0053 - **Everything is an `Entry`.** The alarm config, each firing, each
0054 engine tick — all rows in the same tjai-faithful `entry` table.
0055 Adding a new customization on swf-remote (next project, whatever it
0056 is) = reuse the same table with a new `kind` value. `data` JSONField
0057 carries the per-kind metadata.
0058 - **Snowflake per alarm — no registry, no "kinds".** Each alarm has its
0059 own Python module at `alarms/swf_alarms/alarms/<name>.py` exposing
0060 `detect(client, params)`. The engine dispatches by importing the
0061 module whose name matches the alarm's entry_id. If two alarms share
0062 code, they share it by importing the same helper out of
0063 `alarms/swf_alarms/lib/`, not by being entries in a central dispatch
0064 table.
0065 - **State-based dedup (not cooldown timers).** One active event per
0066 (alarm, entity); while that event exists the engine bumps its
0067 `data.last_seen` without re-emailing (unless the per-alarm
0068 renotification window has elapsed). When the condition goes away, the
0069 engine sets `data.clear_time = now`. Next time it re-appears, a new
0070 event (and a new email) fires.
0071 - **Nav injection.** The alarm dashboard lives on the production header
0072 menu alongside PCS. swf-remote's own pages use a local base template;
0073 proxied swf-monitor pages (PanDA, PCS) get an `Alarms` link injected
0074 in `monitor_client.proxy()` the same way `nav-auth` is swapped.
0075
0076 ## Architecture
0077
0078 ```
0079 ┌──────────────────────┐
0080 │ swf-alarms engine │ (cron */5 min)
0081 │ alarms/swf_alarms/ │
0082 │ standalone venv │
0083 └───┬──────────────┬───┘
0084 │ https │ psycopg
0085 │ (loopback) │
0086 ▼ ▼
0087 ┌─────────────────────┐ ┌──────────────────────────┐
0088 │ swf-remote Django │ │ Postgres (swf_remote) │
0089 │ /prod/api/panda/* │──►│ entry, entry_context, │
0090 │ /prod/alarms/ │ │ entry_version │
0091 └──────────┬──────────┘ └──────────────────────────┘
0092 │ SSH tunnel
0093 ▼
0094 ┌─────────────────────┐ ┌──────────────────────────┐
0095 │ swf-monitor (BNL) │ │ AWS SES │
0096 │ /api/panda/tasks/… │ │ alarm emails │
0097 └─────────────────────┘ └──────────────────────────┘
0098 ```
0099
0100 ## Entry conventions used by alarms
0101
0102 All rows live in context `swf-alarms` (except teams, which live in
0103 `teams`). Rows are filtered out of live views when `archived=True`
0104 (explicit boolean, separate from `status`).
0105
0106 | kind | data.entry_id | What it represents |
0107 |---------------|------------------------------|--------------------|
0108 | `alarm` | `alarm_<name>` | One configured alarm. `content` is the description / email body. `data.params` holds thresholds etc. `data.recipients` routes emails. `data.enabled` gates **email delivery only** — the algorithm always runs. `data.renotification_window_hours` controls re-email. |
0109 | `event` | `event_<name>` (NON-UNIQUE) | One firing instance. Many rows share the same `entry_id`. `data.fire_time` set when created, `data.clear_time` null=active, set=cleared. `data.dedupe_key` identifies the entity (e.g. task id). `content` is the email body sent when this fired. |
0110 | `engine_run` | `run_<unix_ts>` | One engine tick. `data` holds aggregate counters, `data.per_alarm` carries per-alarm detail, any error trace. |
0111
0112 Multiple event rows share `data.entry_id` — that's deliberate. `entry_id`
0113 identifies the alarm type; the Entry's UUID distinguishes instances.
0114
0115 ## Alarm config `data` shape
0116
0117 Top-level keys on `data` are engine-universal (same for every alarm):
0118
0119 - `entry_id` — `alarm_<name>`, matches the module filename.
0120 - `enabled` — boolean. Per-alarm **email switch**. When False
0121 the algorithm still runs and events still fire —
0122 only email delivery is suppressed. See "What
0123 'disabled' means" above.
0124 - `recipients` — string or list; emails and/or `@team` references.
0125 - `renotification_window_hours` — float; 0 means one email per lifecycle.
0126 - `params` — nested dict; **per-alarm** keys consumed by that
0127 alarm's `detect()`. The alarm module declares its
0128 PARAMS surface (see below).
0129
0130 ## Engine loop (per tick)
0131
0132 1. Load `kind='alarm'` entries where `archived=False` **regardless of
0133 `data.enabled`**. The algorithm always runs; `enabled` only controls
0134 the email side.
0135 2. For each alarm config:
0136 a. Fetch current active events (clear_time null) for this alarm.
0137 b. Import `swf_alarms.alarms.<name>` and call its `detect(client, params)`.
0138 c. For each detection:
0139 - `dedupe_key` in active-events map → bump `last_seen`. If this
0140 alarm's emails are on AND (the event has never been notified
0141 OR the renotification window has elapsed since `last_notified`),
0142 add it to this alarm's **renotify bundle**.
0143 - Otherwise → create a new `kind='event'` row (fire_time=now,
0144 clear_time=null), store a single-detection body on it (the
0145 event-detail page reads from this), and add it to this alarm's
0146 **new bundle**.
0147 d. For each previously-active event whose `dedupe_key` is NOT in
0148 this tick's detections (and the alarm didn't error), set
0149 `data.clear_time = now`. Auto-clear (unconditional of `enabled`).
0150 3. If this alarm's emails are on AND the bundle is non-empty: ship **one
0151 SES email** covering all new + renotifying detections. On success,
0152 stamp `last_notified = now` on every event included in the bundle.
0153 `notifications_sent` in the engine-run counters increments by one
0154 per bundle, regardless of how many detections the bundle carried.
0155 4. Close out the `engine_run` entry with counters + `data.per_alarm`
0156 (which includes `bundle_new`, `bundle_renotify`, `bundle_sent`).
0157
0158 **One email per alarm per tick**, never one-per-detection. When a tick
0159 tripped N tasks, you receive a single email listing all N — not N
0160 emails.
0161
0162 Transient fetch failure on one alarm does NOT auto-clear that alarm's
0163 active events — the last known state is preserved until the next
0164 successful tick.
0165
0166 ## Dashboard
0167
0168 At `/prod/alarms/`. Parts:
0169
0170 1. **Engine health banner** — ok / warn / bad / unknown, from last
0171 `engine_run` finished time and error count. Shows seconds until the
0172 next */5 boundary.
0173 2. **Teams** — reusable recipient aliases. `@<teamname>` references
0174 expand to member emails at send time. Editor is its own page.
0175 3. **Summary table** — one row per alarm config: name (link to section),
0176 enabled, alarms-since-N-hours (N user-settable via the
0177 `Since` filter, default 24), currently-active count, last-fired
0178 time. A yellow **quiet** badge appears next to alarm names that saw
0179 zero detections in the last few runs despite prior history — a
0180 heuristic for silently-broken alarms.
0181 4. **Per-alarm section** (one per active alarm config):
0182 - Header: name, `[Edit]` button.
0183 - Metadata table: entry_id, created/modified, recipients, params.
0184 - Body/description card.
0185 - Events-since-N-hours table (reverse chron): fire, clear, state,
0186 dedupe key, subject (link to event detail).
0187 5. **Recent engine runs table** — counters per run, per-alarm
0188 breakdown, errors highlighted.
0189
0190 ## Editor — `/prod/alarms/<entry_id>/edit/`
0191
0192 CodeMirror 5 (markdown mode, material-darker theme) on the alarm's
0193 `content` (description / email body). JSON-mode CodeMirror on
0194 `params`. First-class form fields for enabled, recipients,
0195 renotification window.
0196
0197 Features:
0198
0199 - **PARAMS help panel** — the alarm module declares a `PARAMS` dict
0200 (name → type / required / default / description); the editor renders
0201 it as a table above the JSON box so you can see what keys this
0202 specific alarm actually reads.
0203 - **[Test (live, no email)]** — runs the alarm's `detect()` once with
0204 the current in-editor params against live data, shows all detections
0205 in-page. Never emails. Uses the editor's unsaved values so you can
0206 try before saving.
0207 - **[Preview email body]** — composes the email body (description +
0208 a synthetic detection context) so you can see what a real notification
0209 would look like.
0210 - **Autosave** every 10s via POST (JSON body). Also on Ctrl/Cmd-S, and
0211 on `beforeunload` via `navigator.sendBeacon`.
0212 - **localStorage backup** on every keystroke. If the browser crashes
0213 or the server is unreachable, the backup is visible as a "local" row
0214 in the version-history table with a `[Restore]` button.
0215 - **Version history table** — server-side versions (rendered inline on
0216 page load) with click-to-load. The server creates an `EntryVersion`
0217 row automatically via the `pre_save` signal whenever content or
0218 substantive `data` changes (noise keys like `last_seen` are filtered
0219 out so autosave doesn't spam version rows).
0220
0221 All server-side edits go through `alarm_views.alarm_config_save`; the
0222 Entry's pre_save signal handles versioning transparently.
0223
0224 ## Nav "Alarms" link
0225
0226 Right of PCS, on every production-mode page:
0227
0228 - **swf-remote native pages** (alarm dashboard, editor, event detail):
0229 `src/templates/base.html` has the link in the header nav directly.
0230 - **Proxied swf-monitor pages** (PanDA, PCS, hubs):
0231 `monitor_client.proxy()` injects the link inside the
0232 `<span class="nav-mode nav-production">…</span>` block — same
0233 mechanism that swaps `nav-auth`.
0234
0235 ## Adding a new alarm
0236
0237 There is no "new alarm" button in the UI — alarms are algorithms over
0238 data, not configuration-only records. Adding one is a code + DB + cron
0239 operation by a developer. The mechanism, end to end:
0240
0241 1. **Write the module.** Create
0242 `alarms/swf_alarms/alarms/<name>.py` exposing:
0243
0244 ```python
0245 from ..lib import Detection
0246
0247 PARAMS = {
0248 "threshold": {"type": float, "required": True,
0249 "description": "fire when X exceeds this"},
0250 "since_days": {"type": int, "default": 1,
0251 "description": "look back this many days"},
0252 }
0253
0254 def detect(client, params):
0255 # ... query data via `client`, yield Detection(...) per entity ...
0256 yield Detection(
0257 dedupe_key="…", # stable per-entity
0258 subject="…", # email subject + dashboard row
0259 body_context="…",# appended to the alarm's description
0260 extra_data={}, # structured context for the event row
0261 )
0262 ```
0263
0264 The contract: `detect` must not email, must not raise on transient
0265 failures (log and yield nothing), and must set a stable `dedupe_key`
0266 per entity so state-based dedup works.
0267
0268 2. **Share helpers, not dispatch.** If the algorithm is similar to an
0269 existing one, import a helper from `alarms/swf_alarms/lib/`. Do
0270 **not** add a central registry entry — there is no registry.
0271
0272 3. **Create the DB config.** Add an `Entry` row via a data migration
0273 (preferred: reproducible) or Django shell. Schema:
0274
0275 ```python
0276 Entry(
0277 kind='alarm',
0278 context=<EntryContext name='swf-alarms'>,
0279 title="Human-readable title",
0280 content="Description that prefixes the email body…",
0281 data={
0282 'entry_id': 'alarm_<name>', # must match module
0283 'enabled': True,
0284 'recipients': ['@prodops', 'alice@example.com'],
0285 'renotification_window_hours': 24,
0286 'params': { ... keys from PARAMS ... },
0287 },
0288 status='active',
0289 archived=False,
0290 )
0291 ```
0292
0293 4. **Pick it up on the next tick.** The engine runs every 5 minutes
0294 via cron (`alarms/deploy/crontab.example`). New modules are picked
0295 up automatically by the next tick — no engine restart required, no
0296 redeploy required. If you want to run it immediately:
0297
0298 ```bash
0299 /home/admin/github/swf-remote/alarms/.venv/bin/swf-alarms-run \
0300 --config /home/admin/github/swf-remote/alarms/config.toml --dry-run -v
0301 ```
0302
0303 (Drop `--dry-run` to send real emails.)
0304
0305 5. **Django side picks up the PARAMS help immediately.** The editor
0306 imports the alarm module to render its PARAMS help panel, so as
0307 soon as the dev tree is deployed via `deploy/update_from_dev.sh`,
0308 the editor shows the new alarm's param surface.
0309
0310 Removing an alarm: set `enabled=False` (keeps history visible), or
0311 `archived=True` (hides from dashboard). The module file can stay — it's
0312 harmless code until referenced by an Entry.
0313
0314 ## Adding a new channel
0315
0316 Add `send_<channel>(alarm, **cfg) -> bool` in `alarms/swf_alarms/notify.py`.
0317 Wire it into `run.py` behind a per-alarm or global `channels` config knob.
0318 Failures must return False (never raise) so one stuck channel doesn't
0319 cascade.
0320
0321 ## Files (where the code lives)
0322
0323 **Django side** (`src/remote_app/`):
0324
0325 | File | Purpose |
0326 |---|---|
0327 | `models.py` | `EntryContext`, `Entry`, `EntryVersion`. tjai-faithful fields; `archived` boolean; pinned `db_table` names. |
0328 | `migrations/0001_initial.py` | Schema. |
0329 | `migrations/0002_seed_alarms.py` | Seeds `swf-alarms` context + initial alarm configs. |
0330 | `migrations/0003_seed_teams.py` | Seeds the `teams` context + `@prodops`; adds `renotification_window_hours` to existing alarms. |
0331 | `migrations/0005_drop_alarm_kind.py` | Drops legacy `data.kind` from alarm rows (pre-snowflake residue). |
0332 | `migrations/0006_rename_days_window.py` | Renames `data.params.days_window` → `since_days` on existing alarm rows. |
0333 | `signals.py` | `pre_save` snapshot on Entry. |
0334 | `alarms_data.py` | ORM query helpers. Functions named `events_since`, `count_events_since`, `quiet_alarms`. |
0335 | `alarm_views.py` | `alarms_dashboard`, `alarm_event_detail`, `alarm_config_edit/save/version`, `alarm_test`, `alarm_preview`, team views. Reads alarm modules' `PARAMS` for the editor help panel. |
0336 | `views.py` | Re-exports alarm views. |
0337 | `urls.py` | Alarm routes. |
0338 | `templates/monitor_app/alarms.html` | Dashboard. |
0339 | `templates/monitor_app/alarm_config_edit.html` | Editor. |
0340 | `templates/monitor_app/alarm_event_detail.html` | Single firing detail. |
0341 | `templates/monitor_app/team_edit.html` | Team editor. |
0342 | `templatetags/swf_fmt.py` | `fmt_dt` and `state_class`. |
0343 | `monitor_client.py` | Alarms-link injection in proxied HTML. |
0344 | `src/templates/base.html` | Prod-style header nav + Alarms link. |
0345
0346 **Engine side** (`alarms/`):
0347
0348 | File | Purpose |
0349 |---|---|
0350 | `swf_alarms/config.py` | TOML loader — engine-level settings + DB DSN. |
0351 | `swf_alarms/db.py` | psycopg layer over `entry`/`entry_context`/`entry_version`. Helpers: `list_alarm_configs`, `active_events_for_alarm`, `create_event`, `touch_event_last_seen`, `clear_event`, `start/finish_engine_run`. |
0352 | `swf_alarms/fetch.py` | HTTP client for the swf-monitor REST. |
0353 | `swf_alarms/lib/__init__.py` | `Detection` dataclass — the value-type alarm modules yield. |
0354 | `swf_alarms/lib/failure_rate.py` | Shared PanDA-task failure-rate helper + its `PARAMS`. |
0355 | `swf_alarms/alarms/<name>.py` | One snowflake alarm module per configured alarm. Currently: `panda_failure_rate_sakib`, `panda_failure_rate_eic_all`. |
0356 | `swf_alarms/run.py` | Engine entry point. Loads configs, drives active/clear semantics, writes events + engine_run entries. Supports `--dry-run`. |
0357 | `swf_alarms/notify.py` | SES send. Channel failures log but never raise. |
0358 | `deploy/install.sh` | venv + log dir. Schema is owned by swf-remote migrations. |
0359 | `deploy/crontab.example` | */5 min cadence. |
0360 | `config.toml.example` | Engine, DB, email only. |
0361 | `pyproject.toml` | `httpx`, `boto3`, `psycopg[binary]`. |
0362
0363 ## Future work
0364
0365 - Mattermost channel (SES-parallel).
0366 - Per-task-owner routing (lookup task username → recipient at
0367 event-create time; config recipients list becomes a fallback).
0368 - Acknowledgement: ack button on active events to suppress notify
0369 without waiting for auto-clear.
0370 - Time-bucket charts per alarm (events/hour over last N days).