Back to home page

EIC code displayed by LXR

 
 

    


Warning, /swf-monitor/docs/alarms.md is written in an unsupported language. File is not indexed.

0001 # swf-monitor alarms
0002 
0003 Alarms are now owned by `swf-monitor` on `pandaserver02`. The old
0004 `swf-remote` alarm code and database remain available for rollback/reference,
0005 but the live alarm dashboard, editor, runtime state, and cron runner are in
0006 this repository.
0007 
0008 ## Runtime
0009 
0010 - Dashboard: `/swf-monitor/alarms/`
0011 - External face: `/prod/alarms/` through `swf-remote`, proxied to monitor
0012 - Engine code: `alarms/swf_alarms/`
0013 - Engine install: `/opt/swf-monitor/shared/alarms-venv`
0014 - Engine config: `/opt/swf-monitor/config/alarms/config.toml`
0015 - Engine logs: `/opt/swf-monitor/shared/logs/swf-alarms/`
0016 - Cadence: every 5 minutes via cron
0017 
0018 The alarm engine is standalone. It does not boot Django. It reads alarm
0019 configuration and writes event/run state directly through psycopg against the
0020 monitor Postgres `entry`, `entry_context`, and `entry_version` tables.
0021 
0022 ## Email
0023 
0024 The alarm engine sends through AWS SES using `boto3`. `notify.py` is the send
0025 hook. This is intentionally isolated so the delivery channel can be replaced
0026 with a BNL-supported SMTP relay or mail API without changing alarm detection.
0027 
0028 ## Data Model
0029 
0030 Rows use a tjai-style generic entry model.
0031 
0032 | Context | Kind | Meaning |
0033 |---|---|---|
0034 | `swf-alarms` | `alarm` | One configured alarm. `data.entry_id` names the Python module. `data.enabled` gates email only. |
0035 | `swf-alarms` | `event` | One firing instance. `data.clear_time` null means active. |
0036 | `swf-alarms` | `engine_run` | One engine tick, with aggregate and per-alarm counters. |
0037 | `teams` | `team` | Recipient aliases such as `@prodops`. |
0038 
0039 The imported cutover state from `swf-remote` contained 2 contexts, 17,554
0040 entries, and 38 versions: 2 alarm configs, 67 events, 17,484 engine runs, and
0041 1 team.
0042 
0043 ## Detection Flow
0044 
0045 1. Load all non-archived `kind='alarm'` entries.
0046 2. Import `swf_alarms.alarms.<name>` from `data.entry_id`.
0047 3. Call `detect(client, params)`.
0048 4. Create or update event rows using stable `dedupe_key` values.
0049 5. Clear events that are no longer detected on a successful tick.
0050 6. If email is enabled for that alarm and the run is not `--dry-run`, bundle
0051    new/renotified detections into one email for that alarm.
0052 
0053 `data.enabled=False` means "silent": detection still runs, event rows still
0054 update, and the dashboard remains truthful. To stop an alarm algorithm entirely,
0055 archive the alarm row.
0056 
0057 ## Adding a New Alarm
0058 
0059 1. Add `alarms/swf_alarms/alarms/<name>.py`.
0060 2. Expose a `PARAMS` dict and `detect(client, params)`.
0061 3. Yield `Detection(...)` objects from `swf_alarms.common`.
0062 4. Share helper code under `swf_alarms/common/`.
0063 5. Add a corresponding `Entry(kind='alarm', context='swf-alarms')` with
0064    `data.entry_id='alarm_<name>'`.
0065 
0066 The engine dispatches by module name. There is no central registry.
0067 
0068 ## swf-remote Boundary
0069 
0070 `swf-remote` no longer owns alarm runtime or copied alarm navigation. It
0071 preserves monitor-rendered production navigation and replaces only the local
0072 auth block. `/prod/alarms/...` is a proxy to monitor.
0073 
0074 Old `swf-remote` alarm files are intentionally retained for rollback/reference.
0075 Do not delete them as part of routine monitor-side alarm work.