Warning, /swf-remote/alarms/README.md is written in an unsupported language. File is not indexed.
0001 # swf-alarms
0002
0003 Standalone polling alarm engine for the swf ecosystem (PanDA, streaming
0004 workflow). Zero Django coupling — pulls PanDA data via REST (through
0005 swf-remote's proxy, which owns the SSH tunnel to pandaserver02), persists
0006 state in swf-remote's Postgres (same DB as the Django dashboard), sends
0007 email via AWS SES.
0008
0009 Full system overview: see `../docs/alarms.md`. This README is the
0010 engine-developer entry point.
0011
0012 ## Why standalone
0013
0014 - Runs on any host with network access to swf-remote — no Django
0015 bootstrap, no project PYTHONPATH, no management command.
0016 - Lightweight deps (`httpx`, `boto3`, `psycopg`) mean a small, portable
0017 venv.
0018 - The Django side only *reads* alarm state to render dashboards.
0019
0020 ## Install
0021
0022 ```bash
0023 cd /home/admin/github/swf-remote/alarms
0024 bash deploy/install.sh
0025 ```
0026
0027 Creates `.venv/`, copies `config.toml.example` → `config.toml` if absent.
0028
0029 Edit `config.toml` (SES region, from address, DB DSN) before the first
0030 live run.
0031
0032 ## Run
0033
0034 Dry-run (writes state, suppresses email):
0035
0036 ```bash
0037 .venv/bin/swf-alarms-run --config config.toml --dry-run -v
0038 ```
0039
0040 For real:
0041
0042 ```bash
0043 .venv/bin/swf-alarms-run --config config.toml -v
0044 ```
0045
0046 ## Schedule
0047
0048 See `deploy/crontab.example`. Every 5 minutes is the default cadence.
0049
0050 ## Data source
0051
0052 The engine hits `https://epic-devcloud.org/prod/api/panda/tasks/` —
0053 swf-remote's transparent proxy onto swf-monitor at BNL. Adding new
0054 panels of PanDA data (queues, jobs, errors) is a question of (a)
0055 swf-monitor exposing another REST endpoint and (b) swf-remote routing
0056 it through the existing `panda_api_proxy` catch-all. No engine change
0057 required.
0058
0059 ## Adding a new alarm
0060
0061 See `../docs/alarms.md` § "Adding a new alarm" for the full mechanism.
0062 Summary:
0063
0064 1. Drop `swf_alarms/alarms/<name>.py` exposing a `PARAMS` dict and
0065 `def detect(client, params)`, yielding `Detection(...)` objects.
0066 2. Share math via `swf_alarms/lib/*` — no central registry.
0067 3. Create an `Entry` row (kind='alarm', context='swf-alarms',
0068 data.entry_id matching the module name) via data migration or
0069 Django shell.
0070 4. Next cron tick picks it up automatically.
0071
0072 The contract: `detect` must not email, must not raise on transient
0073 fetch failures (log + yield nothing), and must set a stable
0074 `dedupe_key` per entity so state-based dedup works.
0075
0076 ## Adding a new channel
0077
0078 Add `send_<channel>(alarm, **cfg) -> bool` in `notify.py`. Wire into
0079 `run.py` behind a `channels = [...]` config knob. Failures must return
0080 False (not raise) so one stuck channel can't cascade.
0081
0082 ## "Disabled" (per-alarm) semantics
0083
0084 Each alarm's `data.enabled` flag controls **only the email side**. When
0085 False:
0086
0087 - The algorithm still runs every tick.
0088 - Event rows are still created, and active/clear still ticks.
0089 - The dashboard still shows everything.
0090 - **No SES call is made.** `last_notified` is not updated.
0091
0092 When True, the engine additionally sends email on new detections and on
0093 renotification. "Stop the algorithm entirely" is `archived=True`, not
0094 `enabled=False`. There is no global email switch — per-alarm is the
0095 only control.
0096
0097 ## Dedup and renotification
0098
0099 - **State-based dedup.** One active `event` row per `(alarm, entity)`.
0100 While active, the engine bumps `data.last_seen` without re-emailing.
0101 - **Auto-clear.** On a successful tick where the entity is no longer
0102 in the detection set, the event's `data.clear_time` is set to now.
0103 A transient fetch failure does NOT auto-clear — last-known state is
0104 preserved.
0105 - **One email per alarm per tick.** Every detection that would warrant
0106 a send this tick (new events, plus events whose renotification
0107 window has elapsed, plus events created while emails were off) is
0108 bundled into a single SES email. No more one-email-per-task.
0109 - **Renotification window.** Per-alarm `data.renotification_window_hours`.
0110 Governs when a still-firing event is eligible to be re-included in
0111 the next bundle. 0 / missing = one email per event lifecycle (the
0112 event is bundled once when new, never renotified until it clears and
0113 re-fires).
0114
0115 ## Dashboard
0116
0117 Served by swf-remote Django at `/prod/alarms/`. Reads from the same
0118 Postgres `entry` table the engine writes. See
0119 `../src/remote_app/alarm_views.py`.