Back to home page

EIC code displayed by LXR

 
 

    


Warning, /swf-monitor/alarms/README.md is written in an unsupported language. File is not indexed.

0001 # swf-alarms
0002 
0003 Standalone polling alarm engine for the swf ecosystem (PanDA, streaming
0004 workflow). Zero Django coupling: pulls PanDA data via swf-monitor REST,
0005 persists state in swf-monitor's Postgres, sends email via AWS SES.
0006 
0007 Full system overview: see `../docs/alarms.md`. This README is the
0008 engine-developer entry point.
0009 
0010 ## Why standalone
0011 
0012 - Runs on the monitor host without Django bootstrap, project PYTHONPATH,
0013   or a management command.
0014 - Lightweight deps (`httpx`, `boto3`, `psycopg`) mean a small, portable
0015   venv.
0016 - The Django side renders dashboards and writes alarm/team configuration
0017   edits; the standalone engine writes event and engine-run state.
0018 
0019 ## Install
0020 
0021 ```bash
0022 cd /opt/swf-monitor/current/alarms
0023 bash deploy/install.sh
0024 ```
0025 
0026 Creates `/opt/swf-monitor/shared/alarms-venv`, copies
0027 `config.toml.example` to `/opt/swf-monitor/config/alarms/config.toml`
0028 if absent.
0029 
0030 Edit `config.toml` (SES region, from address, DB DSN) before the first
0031 live run.
0032 
0033 ## Run
0034 
0035 Dry-run (writes state, suppresses email):
0036 
0037 ```bash
0038 /opt/swf-monitor/shared/alarms-venv/bin/swf-alarms-run --config /opt/swf-monitor/config/alarms/config.toml --dry-run -v
0039 ```
0040 
0041 For real:
0042 
0043 ```bash
0044 /opt/swf-monitor/shared/alarms-venv/bin/swf-alarms-run --config /opt/swf-monitor/config/alarms/config.toml -v
0045 ```
0046 
0047 ## Schedule
0048 
0049 See `deploy/crontab.example`. Every 5 minutes is the default cadence.
0050 
0051 ## Data source
0052 
0053 The engine hits swf-monitor's `/api/panda/*` endpoints using the
0054 `engine.service_base_url` from `config.toml`. Adding new panels of
0055 PanDA data (queues, jobs, errors) is a question of swf-monitor exposing
0056 another REST endpoint. No engine topology change is required.
0057 
0058 ## Adding a new alarm
0059 
0060 See `../docs/alarms.md` "Adding a new alarm" for the full mechanism.
0061 Summary:
0062 
0063 1. Drop `swf_alarms/alarms/<name>.py` exposing a `PARAMS` dict and
0064    `def detect(client, params)`, yielding `Detection(...)` objects.
0065 2. Share math via `swf_alarms/common/*`; there is no central registry.
0066 3. Create an `Entry` row (kind='alarm', context='swf-alarms',
0067    data.entry_id matching the module name) via data migration or
0068    Django shell.
0069 4. Next cron tick picks it up automatically.
0070 
0071 The contract: `detect` must not email, must not raise on transient
0072 fetch failures (log + yield nothing), and must set a stable
0073 `dedupe_key` per entity so state-based dedup works.
0074 
0075 ## Adding a new channel
0076 
0077 Add `send_<channel>(alarm, **cfg) -> bool` in `notify.py`. Wire into
0078 `run.py` behind a `channels = [...]` config knob. Failures must return
0079 False (not raise) so one stuck channel can't cascade.
0080 
0081 ## "Disabled" (per-alarm) semantics
0082 
0083 Each alarm's `data.enabled` flag controls **only the email side**. When
0084 False:
0085 
0086 - The algorithm still runs every tick.
0087 - Event rows are still created, and active/clear still ticks.
0088 - The dashboard still shows everything.
0089 - **No SES call is made.** `last_notified` is not updated.
0090 
0091 When True, the engine additionally sends email on new detections and on
0092 renotification. "Stop the algorithm entirely" is `archived=True`, not
0093 `enabled=False`. There is no global email switch — per-alarm is the
0094 only control.
0095 
0096 ## Dedup and renotification
0097 
0098 - **State-based dedup.** One active `event` row per `(alarm, entity)`.
0099   While active, the engine bumps `data.last_seen` without re-emailing.
0100 - **Auto-clear.** On a successful tick where the entity is no longer
0101   in the detection set, the event's `data.clear_time` is set to now.
0102   A transient fetch failure does NOT auto-clear — last-known state is
0103   preserved.
0104 - **One email per alarm per tick.** Every detection that would warrant
0105   a send this tick (new events, plus events whose renotification
0106   window has elapsed, plus events created while emails were off) is
0107   bundled into a single SES email. No more one-email-per-task.
0108 - **Renotification window.** Per-alarm `data.renotification_window_hours`.
0109   Governs when a still-firing event is eligible to be re-included in
0110   the next bundle. 0 / missing = one email per event lifecycle (the
0111   event is bundled once when new, never renotified until it clears and
0112   re-fires).
0113 
0114 ## Dashboard
0115 
0116 Served by swf-monitor Django at `/swf-monitor/alarms/`. Reads from the same
0117 Postgres `entry` table the engine writes. See
0118 `../src/monitor_app/alarm_views.py`.