Back to home page

EIC code displayed by LXR

 
 

    


Warning, /swf-remote/alarms/README.md is written in an unsupported language. File is not indexed.

0001 # swf-alarms
0002 
0003 Standalone polling alarm engine for the swf ecosystem (PanDA, streaming
0004 workflow). Zero Django coupling — pulls PanDA data via REST (through
0005 swf-remote's proxy, which owns the SSH tunnel to pandaserver02), persists
0006 state in swf-remote's Postgres (same DB as the Django dashboard), sends
0007 email via AWS SES.
0008 
0009 Full system overview: see `../docs/alarms.md`. This README is the
0010 engine-developer entry point.
0011 
0012 ## Why standalone
0013 
0014 - Runs on any host with network access to swf-remote — no Django
0015   bootstrap, no project PYTHONPATH, no management command.
0016 - Lightweight deps (`httpx`, `boto3`, `psycopg`) mean a small, portable
0017   venv.
0018 - The Django side only *reads* alarm state to render dashboards.
0019 
0020 ## Install
0021 
0022 ```bash
0023 cd /home/admin/github/swf-remote/alarms
0024 bash deploy/install.sh
0025 ```
0026 
0027 Creates `.venv/`, copies `config.toml.example` → `config.toml` if absent.
0028 
0029 Edit `config.toml` (SES region, from address, DB DSN) before the first
0030 live run.
0031 
0032 ## Run
0033 
0034 Dry-run (writes state, suppresses email):
0035 
0036 ```bash
0037 .venv/bin/swf-alarms-run --config config.toml --dry-run -v
0038 ```
0039 
0040 For real:
0041 
0042 ```bash
0043 .venv/bin/swf-alarms-run --config config.toml -v
0044 ```
0045 
0046 ## Schedule
0047 
0048 See `deploy/crontab.example`. Every 5 minutes is the default cadence.
0049 
0050 ## Data source
0051 
0052 The engine hits `https://epic-devcloud.org/prod/api/panda/tasks/` —
0053 swf-remote's transparent proxy onto swf-monitor at BNL. Adding new
0054 panels of PanDA data (queues, jobs, errors) is a question of (a)
0055 swf-monitor exposing another REST endpoint and (b) swf-remote routing
0056 it through the existing `panda_api_proxy` catch-all. No engine change
0057 required.
0058 
0059 ## Adding a new alarm
0060 
0061 See `../docs/alarms.md` § "Adding a new alarm" for the full mechanism.
0062 Summary:
0063 
0064 1. Drop `swf_alarms/alarms/<name>.py` exposing a `PARAMS` dict and
0065    `def detect(client, params)`, yielding `Detection(...)` objects.
0066 2. Share math via `swf_alarms/lib/*` — no central registry.
0067 3. Create an `Entry` row (kind='alarm', context='swf-alarms',
0068    data.entry_id matching the module name) via data migration or
0069    Django shell.
0070 4. Next cron tick picks it up automatically.
0071 
0072 The contract: `detect` must not email, must not raise on transient
0073 fetch failures (log + yield nothing), and must set a stable
0074 `dedupe_key` per entity so state-based dedup works.
0075 
0076 ## Adding a new channel
0077 
0078 Add `send_<channel>(alarm, **cfg) -> bool` in `notify.py`. Wire into
0079 `run.py` behind a `channels = [...]` config knob. Failures must return
0080 False (not raise) so one stuck channel can't cascade.
0081 
0082 ## "Disabled" (per-alarm) semantics
0083 
0084 Each alarm's `data.enabled` flag controls **only the email side**. When
0085 False:
0086 
0087 - The algorithm still runs every tick.
0088 - Event rows are still created, and active/clear still ticks.
0089 - The dashboard still shows everything.
0090 - **No SES call is made.** `last_notified` is not updated.
0091 
0092 When True, the engine additionally sends email on new detections and on
0093 renotification. "Stop the algorithm entirely" is `archived=True`, not
0094 `enabled=False`. There is no global email switch — per-alarm is the
0095 only control.
0096 
0097 ## Dedup and renotification
0098 
0099 - **State-based dedup.** One active `event` row per `(alarm, entity)`.
0100   While active, the engine bumps `data.last_seen` without re-emailing.
0101 - **Auto-clear.** On a successful tick where the entity is no longer
0102   in the detection set, the event's `data.clear_time` is set to now.
0103   A transient fetch failure does NOT auto-clear — last-known state is
0104   preserved.
0105 - **One email per alarm per tick.** Every detection that would warrant
0106   a send this tick (new events, plus events whose renotification
0107   window has elapsed, plus events created while emails were off) is
0108   bundled into a single SES email. No more one-email-per-task.
0109 - **Renotification window.** Per-alarm `data.renotification_window_hours`.
0110   Governs when a still-firing event is eligible to be re-included in
0111   the next bundle. 0 / missing = one email per event lifecycle (the
0112   event is bundled once when new, never renotified until it clears and
0113   re-fires).
0114 
0115 ## Dashboard
0116 
0117 Served by swf-remote Django at `/prod/alarms/`. Reads from the same
0118 Postgres `entry` table the engine writes. See
0119 `../src/remote_app/alarm_views.py`.