source/sphenixprod/dashboard_design.md

0001 # Production Monitoring Dashboard — Design Notes
0002
0003 ## Goal
0004
0005 A Grafana dashboard for monitoring ongoing sPHENIX production campaigns, showing job progress, resource usage, and failure rates. The primary audience is production curators who need a live view of what is running and how healthy it is, as well as users who just need a progress view.
0006
0007 ---
0008
0009 ## Fundamental splits
0010
0011 ### Tracking vs. calo
0012
0013 There is a fundamental physics-motivated separation between tracking and calorimeter productions. A single `tag` (campaign identifier) can and does contain both tracking and calo jobs, so the split is a property of the **(tag, production_type) pair**, not the tag alone.
0014
0015 The production type is inferred unambiguously from `dsttype`:
0016
0017 | production_type | dsttype patterns |
0018 |---|---|
0019 | `tracking` | `DST_STREAMING_EVENT%`, `DST_TRKR_%` |
0020 | `calo` | `DST_TRIGGERED_EVENT%`, `DST_CALOFITTING`, `DST_CALO`, `DST_JETS`, `DST_JETCALO` |
0021
0022 Source of truth for all current dsttypes: [sphenixjobdicts.py](sphenixjobdicts.py).
0023
0024 ### Dashboard structure
0025
0026 One dashboard with a `$prod_type` dropdown variable (`tracking` / `calo`) rather than two separate dashboards. A second dashboard would mean every panel fix has to be applied twice.
0027
0028 ---
0029
0030 ## Dashboard layout
0031
0032 ### Row 1 — Campaign health at a glance (stat panels)
0033 - Job counts by status: `submitting`, `submitted`, `started`, `running`, `held`, `evicted`, `failed`, `finished` — color-coded tiles
0034 - Overall completion % — `finished / maxjobsexpected` per tag
0035 - Active jobs right now — `status IN ('started', 'running')`
0036 - Failure rate (%) in a rolling time window
0037
0038 ### Row 2 — Progress by rule/dataset (table panel)
0039 The main "where are we?" view. One row per `(tag, dsttype, dataset)` with columns: expected, submitted, running, finished, failed. The `finished` cell scales green toward 100%; the `failed` cell scales red. Analogous to the Run Table in the sPHENIX Run Database dashboard.
0040
0041 ### Row 3 — Performance and resource trends (time series)
0042 - Jobs finishing per hour (throughput)
0043 - `MemoryUsage` vs `MemoryProvisioned` — reveals how often memory-escalation retries are happening
0044 - Average `RemoteUserCpu` per `dsttype`
0045 - `jobstarts > 1` trend — jobs that restarted, proxy for cluster instability
0046
0047 ---
0048
0049 ## Dashboard variables (Grafana)
0050
0051 | variable | type | query |
0052 |---|---|---|
0053 | `$prod_type` | static dropdown | `tracking`, `calo` |
0054 | `$tag` | dynamic query | `SELECT tag FROM production_overview WHERE active AND production_type = '${prod_type}' ORDER BY priority` |
0055
0056 Every panel filters on `tag = ANY($tag)` and uses a `dsttype`-based predicate to enforce the production_type boundary (so a calo query on tag X does not accidentally pull tracking jobs also carrying tag X).
0057
0058 ---
0059
0060 ## The `production_overview` table
0061
0062 ### Purpose
0063
0064 Acts as the **control plane** for the dashboard. Curators flip `active = true` to make a campaign visible. New `(tag, production_type)` pairs are auto-registered by a Postgres trigger the moment the first matching job is inserted into `production_jobs`, so curators never have to remember to create entries manually.
0065
0066 ### Schema
0067
0068 ```sql
0069 CREATE TABLE IF NOT EXISTS production_overview (
0070     id               SERIAL                   PRIMARY KEY,
0071     tag              TEXT                     NOT NULL,
0072     production_type  TEXT                     NOT NULL,
0073     active           BOOLEAN                  NOT NULL DEFAULT false,
0074     display_name     TEXT,
0075     priority         INT                      NOT NULL DEFAULT 0,
0076     started_at       TIMESTAMP WITH TIME ZONE,
0077     expected_end     TIMESTAMP WITH TIME ZONE,
0078     notes            TEXT,
0079     registered_at    TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP,
0080     UNIQUE (tag, production_type),
0081     CHECK (production_type IN ('tracking', 'calo'))
0082 );
0083 ```
0084
0085 ### Column notes
0086
0087 - **`active`** — defaults `false`; nothing appears on the dashboard until a curator explicitly enables it.
0088 - **`display_name`** — optional human-friendly label for panel titles (e.g. `"Run 3 Au+Au pass-1"`).
0089 - **`priority`** — controls display order; lower numbers appear first. Useful when multiple campaigns are active simultaneously.
0090 - **`started_at` / `expected_end`** — curator-supplied timestamps for ETA and progress-bar panels.
0091 - **`notes`** — free-text annotation surfaced as a dashboard tooltip. Intended for things like known issues, hold reasons, or links to logbooks.
0092 - **`registered_at`** — set automatically on first job insert; gives curators context on when a campaign arrived without needing to query `production_jobs`.
0093 - **`CHECK (production_type IN (...))`** — enforces the allowed values at the DB level independently of the trigger, so a direct INSERT by a curator cannot introduce an invalid type. Easy to extend when a third type appears.
0094
0095 ### Auto-registration trigger
0096
0097 ```sql
0098 CREATE OR REPLACE FUNCTION auto_register_production_overview()
0099 RETURNS TRIGGER AS $$
0100 DECLARE
0101     v_prod_type TEXT;
0102 BEGIN
0103     v_prod_type := CASE
0104         WHEN NEW.dsttype LIKE 'DST_STREAMING_EVENT%' OR NEW.dsttype LIKE 'DST_TRKR_%'
0105             THEN 'tracking'
0106         WHEN NEW.dsttype LIKE 'DST_TRIGGERED_EVENT%'
0107           OR NEW.dsttype IN ('DST_CALOFITTING', 'DST_CALO', 'DST_JETS', 'DST_JETCALO')
0108             THEN 'calo'
0109         ELSE NULL
0110     END;
0111
0112     IF v_prod_type IS NULL THEN
0113         RAISE NOTICE
0114             'auto_register_production_overview: unrecognised dsttype "%" (tag=%, rulename=%) — skipping registration',
0115             NEW.dsttype, NEW.tag, NEW.rulename;
0116         RETURN NEW;
0117     END IF;
0118
0119     INSERT INTO production_overview (tag, production_type)
0120     VALUES (NEW.tag, v_prod_type)
0121     ON CONFLICT (tag, production_type) DO NOTHING;
0122
0123     RETURN NEW;
0124 END;
0125 $$ LANGUAGE plpgsql;
0126
0127 CREATE OR REPLACE TRIGGER trg_auto_register_overview
0128     AFTER INSERT ON production_jobs
0129     FOR EACH ROW EXECUTE FUNCTION auto_register_production_overview();
0130 ```
0131
0132 **`RAISE NOTICE` on unknown dsttype** — rather than silently skipping or aborting the job insert, an unknown dsttype emits a Postgres `NOTICE`. This is visible in server logs and to any client with `\set VERBOSITY verbose`. The intent is to catch new dsttypes early (e.g. `DST_TRKR_MVTXME` is already in `sphenixjobdicts.py` and would currently fall through) without ever blocking job submission.
0133
0134 **`CREATE OR REPLACE TRIGGER`** requires Postgres 14+. On older versions, replace with `DROP TRIGGER IF EXISTS trg_auto_register_overview ON production_jobs; CREATE TRIGGER ...`.
0135
0136 ---
0137
0138 ## Open questions / next steps
0139
0140 - **Panel queries** — write the SQL behind each Grafana panel before touching the dashboard JSON.
0141 - **`DST_TRKR_MVTXME`** — currently falls through the trigger's `CASE` (not in the tracking pattern). Decide if it belongs to `tracking` and add it.
0142 - **Unknown-dsttype alerting** — the `RAISE NOTICE` is passive. Consider a small separate table `production_unknown_dsttypes` that the trigger also inserts into, making it queryable and alertable from Grafana.
0143 - **Curator tooling** — a minimal admin script or SQL snippet for the common curator operations: activate a campaign, set display name, set priority, add a note.