Back to home page

EIC code displayed by LXR

 
 

    


Warning, /swf-monitor/docs/SYSTEM_STATUS.md is written in an unsupported language. File is not indexed.

0001 # System Status
0002 
0003 The System status page is the monitor's cached health view for production
0004 infrastructure. It is intentionally broader than PanDA, but it is routed under
0005 the working production path so it is visible on both the internal monitor and
0006 the devcloud proxy.
0007 
0008 | Surface | URL |
0009 |---|---|
0010 | Internal | `https://pandaserver02.sdcc.bnl.gov/swf-monitor/panda/system/` |
0011 | External | `https://epic-devcloud.org/prod/panda/system/` |
0012 | Internal JSON | `https://pandaserver02.sdcc.bnl.gov/swf-monitor/panda/system/status.json` |
0013 | External JSON | `https://epic-devcloud.org/prod/panda/system/status.json` |
0014 
0015 ## Design rule
0016 
0017 The web tier does **not** probe services on page load. The page and the nav read
0018 the cached database state. The production ops agent is responsible for keeping
0019 that state fresh.
0020 
0021 This preserves the same boundary used elsewhere in epicprod:
0022 
0023 - **Ops agent** performs active checks and writes results.
0024 - **Django web tier** reads cached rows and renders them.
0025 - **Browser nav** polls the cached JSON endpoint, not the underlying services.
0026 
0027 ## Data model
0028 
0029 Current status lives in `monitor_app.SystemStatus`
0030 (`swf_system_status`). Historical observations live in
0031 `monitor_app.SystemStatusHistory` (`swf_system_status_history`).
0032 
0033 Both tables include a JSON `data` field so collectors can add structured
0034 evidence without another schema migration. Current rows are keyed by stable
0035 collector name; history rows are append-only observations used for later
0036 incident review.
0037 
0038 `SystemStatus` core fields:
0039 
0040 - `name`: stable collector key, e.g. `epicprod-ops-agent`
0041 - `category`: display group, e.g. `agents`, `services`, `external`
0042 - `status`: `ok`, `warning`, `error`, or `unknown`
0043 - `summary`: short operator-facing explanation
0044 - `data`: JSON evidence from the collector
0045 - `checked_at`: when the collector produced this observation
0046 
0047 ## Collectors
0048 
0049 The initial collector set is defined in `monitor_app/system_status.py`:
0050 
0051 | Collector | Category | Meaning |
0052 |---|---|---|
0053 | `epicprod-ops-agent` | `agents` | systemd state plus monitor heartbeat row |
0054 | `swf-panda-bot` | `agents` | systemd state plus monitor heartbeat row |
0055 | `swf-monitor-mcp-asgi` | `services` | systemd state |
0056 | `httpd` | `services` | systemd state |
0057 | `epic-devcloud-prod` | `external` | HTTP check of `https://epic-devcloud.org/prod/` |
0058 | `epic-devcloud-doc` | `external` | HTTP check of `https://epic-devcloud.org/doc/` |
0059 
0060 The `external` category is rendered as **Public Web Services** in the UI.
0061 
0062 ## Refresh mechanism
0063 
0064 The ops agent handles `msg_type=refresh_system_status` on
0065 `/queue/epicprod.ops`. It delegates to the standalone doer:
0066 
0067 ```bash
0068 scripts/refresh-system-status.py --source ops_agent_periodic
0069 ```
0070 
0071 This is deliberately **not** a Django management command. The same doer is used
0072 for manual refreshes and periodic refreshes.
0073 
0074 The agent starts a periodic refresh loop:
0075 
0076 - `EPICPROD_SYSTEM_STATUS_INTERVAL`, default `300` seconds
0077 - `EPICPROD_SYSTEM_STATUS_INITIAL_DELAY`, default `30` seconds
0078 - `EPICPROD_SYSTEM_STATUS_TIMEOUT`, default `60` seconds
0079 
0080 Manual refresh from the page posts to `panda/system/refresh/`, which queues the
0081 same `refresh_system_status` message to the ops agent. It does not run checks in
0082 the Apache request.
0083 
0084 ## Overall status
0085 
0086 `status_summary()` derives the aggregate state from cached current rows:
0087 
0088 - `error` if any current row is `error`
0089 - `error` if the latest cached check is older than 15 minutes
0090 - `warning` if any row is `warning` or `unknown`
0091 - `ok` when all current checks are OK and fresh
0092 - `unknown` before any rows exist
0093 
0094 The stale rule is important: if the ops agent stops refreshing, the System menu
0095 must eventually turn red even if the last individual checks were green.
0096 
0097 The stale threshold is the `STATUS_STALE_AFTER` constant in
0098 `monitor_app/system_status.py`, currently 15 minutes. That is three missed
0099 cycles at the default 5-minute ops-agent refresh interval. Tune this constant if
0100 the nav produces false stale alarms or reacts too slowly to a dead refresher.
0101 
0102 ## Navigation indicator
0103 
0104 The production nav `System` item is red when the aggregate status is red.
0105 
0106 Initial page render gets aggregate state through the global context processor
0107 `monitor_app.context_processors.system_status_nav`. While a browser page remains
0108 open, base-template JavaScript polls:
0109 
0110 ```text
0111 panda/system/status.json
0112 ```
0113 
0114 once per minute. The endpoint reads only cached database state. On devcloud the
0115 same reversed URL is served as:
0116 
0117 ```text
0118 /prod/panda/system/status.json
0119 ```
0120 
0121 The browser also applies the 15-minute stale rule locally between JSON polls.
0122 
0123 ## UI conventions
0124 
0125 - Tables size to content instead of stretching across the full window.
0126 - Status cells use the existing BigMon-style filled state classes
0127   (`ok_fill`, `warning_fill`, `error_fill`, `unknown_fill`).
0128 - URLs in summaries and JSON evidence are clickable.
0129 - The header shows dynamic time since the latest cached check.
0130 
0131 ## Operational checks
0132 
0133 Quick health checks after deploy:
0134 
0135 ```bash
0136 curl -sS https://epic-devcloud.org/prod/panda/system/status.json | python3 -m json.tool
0137 curl -sS -H 'Host: pandaserver02.sdcc.bnl.gov' \
0138   http://127.0.0.1/swf-monitor/panda/system/status.json | python3 -m json.tool
0139 systemctl is-active epicprod-ops-agent swf-panda-bot swf-monitor-mcp-asgi httpd
0140 ```
0141 
0142 Expected healthy JSON shape:
0143 
0144 ```json
0145 {
0146   "overall_status": "ok",
0147   "overall_reason": "All current checks are OK.",
0148   "latest_checked_at": "2026-06-17T22:37:27.720575+00:00",
0149   "counts": {
0150     "ok": 6,
0151     "warning": 0,
0152     "error": 0,
0153     "unknown": 0,
0154     "total": 6
0155   }
0156 }
0157 ```