Warning, /swf-monitor/docs/SYSTEM_STATUS.md is written in an unsupported language. File is not indexed.
0001 # System Status
0002
0003 The System status page is the monitor's cached health view for production
0004 infrastructure. It is intentionally broader than PanDA, but it is routed under
0005 the working production path so it is visible on both the internal monitor and
0006 the devcloud proxy.
0007
0008 | Surface | URL |
0009 |---|---|
0010 | Internal | `https://pandaserver02.sdcc.bnl.gov/swf-monitor/panda/system/` |
0011 | External | `https://epic-devcloud.org/prod/panda/system/` |
0012 | Internal JSON | `https://pandaserver02.sdcc.bnl.gov/swf-monitor/panda/system/status.json` |
0013 | External JSON | `https://epic-devcloud.org/prod/panda/system/status.json` |
0014
0015 ## Design rule
0016
0017 The web tier does **not** probe services on page load. The page and the nav read
0018 the cached database state. The production ops agent is responsible for keeping
0019 that state fresh.
0020
0021 This preserves the same boundary used elsewhere in epicprod:
0022
0023 - **Ops agent** performs active checks and writes results.
0024 - **Django web tier** reads cached rows and renders them.
0025 - **Browser nav** polls the cached JSON endpoint, not the underlying services.
0026
0027 ## Data model
0028
0029 Current status lives in `monitor_app.SystemStatus`
0030 (`swf_system_status`). Historical observations live in
0031 `monitor_app.SystemStatusHistory` (`swf_system_status_history`).
0032
0033 Both tables include a JSON `data` field so collectors can add structured
0034 evidence without another schema migration. Current rows are keyed by stable
0035 collector name; history rows are append-only observations used for later
0036 incident review.
0037
0038 `SystemStatus` core fields:
0039
0040 - `name`: stable collector key, e.g. `epicprod-ops-agent`
0041 - `category`: display group, e.g. `agents`, `services`, `external`
0042 - `status`: `ok`, `warning`, `error`, or `unknown`
0043 - `summary`: short operator-facing explanation
0044 - `data`: JSON evidence from the collector
0045 - `checked_at`: when the collector produced this observation
0046
0047 ## Collectors
0048
0049 The initial collector set is defined in `monitor_app/system_status.py`:
0050
0051 | Collector | Category | Meaning |
0052 |---|---|---|
0053 | `epicprod-ops-agent` | `agents` | systemd state plus monitor heartbeat row |
0054 | `swf-panda-bot` | `agents` | systemd state plus monitor heartbeat row |
0055 | `swf-monitor-mcp-asgi` | `services` | systemd state |
0056 | `httpd` | `services` | systemd state |
0057 | `epic-devcloud-prod` | `external` | HTTP check of `https://epic-devcloud.org/prod/` |
0058 | `epic-devcloud-doc` | `external` | HTTP check of `https://epic-devcloud.org/doc/` |
0059
0060 The `external` category is rendered as **Public Web Services** in the UI.
0061
0062 ## Refresh mechanism
0063
0064 The ops agent handles `msg_type=refresh_system_status` on
0065 `/queue/epicprod.ops`. It delegates to the standalone doer:
0066
0067 ```bash
0068 scripts/refresh-system-status.py --source ops_agent_periodic
0069 ```
0070
0071 This is deliberately **not** a Django management command. The same doer is used
0072 for manual refreshes and periodic refreshes.
0073
0074 The agent starts a periodic refresh loop:
0075
0076 - `EPICPROD_SYSTEM_STATUS_INTERVAL`, default `300` seconds
0077 - `EPICPROD_SYSTEM_STATUS_INITIAL_DELAY`, default `30` seconds
0078 - `EPICPROD_SYSTEM_STATUS_TIMEOUT`, default `60` seconds
0079
0080 Manual refresh from the page posts to `panda/system/refresh/`, which queues the
0081 same `refresh_system_status` message to the ops agent. It does not run checks in
0082 the Apache request.
0083
0084 ## Overall status
0085
0086 `status_summary()` derives the aggregate state from cached current rows:
0087
0088 - `error` if any current row is `error`
0089 - `error` if the latest cached check is older than 15 minutes
0090 - `warning` if any row is `warning` or `unknown`
0091 - `ok` when all current checks are OK and fresh
0092 - `unknown` before any rows exist
0093
0094 The stale rule is important: if the ops agent stops refreshing, the System menu
0095 must eventually turn red even if the last individual checks were green.
0096
0097 The stale threshold is the `STATUS_STALE_AFTER` constant in
0098 `monitor_app/system_status.py`, currently 15 minutes. That is three missed
0099 cycles at the default 5-minute ops-agent refresh interval. Tune this constant if
0100 the nav produces false stale alarms or reacts too slowly to a dead refresher.
0101
0102 ## Navigation indicator
0103
0104 The production nav `System` item is red when the aggregate status is red.
0105
0106 Initial page render gets aggregate state through the global context processor
0107 `monitor_app.context_processors.system_status_nav`. While a browser page remains
0108 open, base-template JavaScript polls:
0109
0110 ```text
0111 panda/system/status.json
0112 ```
0113
0114 once per minute. The endpoint reads only cached database state. On devcloud the
0115 same reversed URL is served as:
0116
0117 ```text
0118 /prod/panda/system/status.json
0119 ```
0120
0121 The browser also applies the 15-minute stale rule locally between JSON polls.
0122
0123 ## UI conventions
0124
0125 - Tables size to content instead of stretching across the full window.
0126 - Status cells use the existing BigMon-style filled state classes
0127 (`ok_fill`, `warning_fill`, `error_fill`, `unknown_fill`).
0128 - URLs in summaries and JSON evidence are clickable.
0129 - The header shows dynamic time since the latest cached check.
0130
0131 ## Operational checks
0132
0133 Quick health checks after deploy:
0134
0135 ```bash
0136 curl -sS https://epic-devcloud.org/prod/panda/system/status.json | python3 -m json.tool
0137 curl -sS -H 'Host: pandaserver02.sdcc.bnl.gov' \
0138 http://127.0.0.1/swf-monitor/panda/system/status.json | python3 -m json.tool
0139 systemctl is-active epicprod-ops-agent swf-panda-bot swf-monitor-mcp-asgi httpd
0140 ```
0141
0142 Expected healthy JSON shape:
0143
0144 ```json
0145 {
0146 "overall_status": "ok",
0147 "overall_reason": "All current checks are OK.",
0148 "latest_checked_at": "2026-06-17T22:37:27.720575+00:00",
0149 "counts": {
0150 "ok": 6,
0151 "warning": 0,
0152 "error": 0,
0153 "unknown": 0,
0154 "total": 6
0155 }
0156 }
0157 ```