swf-monitor/docs/EPICPROD_DATA_LINEAGE.md

0001 # ePIC Production Data Lineage
0002
0003 A production system must record and reference the data it produces. epicprod
0004 produces reconstruction datasets that land in Rucio; PCS already holds their full
0005 production provenance — physics, EvGen, simulation, reconstruction tags,
0006 campaign, requestor, use flags, status — with rich filtering over all of it. What
0007 it does not yet hold is the **explicit Rucio reference** for each produced
0008 dataset: the `epic:/RECO/…` DID(s), their RSEs, and file counts. Recording those
0009 closes the production record — and, the provenance already being filterable, lets
0010 the catalog locate produced data by any provenance dimension.
0011
0012 The immediate application is verifying campaign **lineage and completeness** —
0013 the record the production team keeps today by hand (the `default_datasets`
0014 catalogue and monthly completion email).
0015
0016 ## Problem and scope
0017
0018 The requested-side information is already in the PCS task database —
0019 `default_datasets` seeded our catalog and everything it shows is in our
0020 `ProdTask` records. What is missing, in our catalog *and* on the
0021 `default_datasets` HTML page, is the explicit link to Rucio. Without it a filter
0022 cannot yet resolve to Rucio datasets or files; with it, it can.
0023
0024 The work is to **gather** the Rucio links and write them onto the existing task
0025 records' extensible JSON — not a second database of mappings. Stored there, the
0026 links unlock two capabilities over the catalog's existing filters:
0027
0028 - **Reference** — resolve a filtered task set to its produced Rucio
0029   datasets/files. A plain catalog read, no credential.
0030 - **Access** — fetch the produced data itself over XRootD from those links:
0031   sample a file, pull a whole file when small (as the payload-log doer does for
0032   logs), examine it. The agent's credentialed xrootd substrate; no credential in
0033   the web tier.
0034
0035 So we win not just the correlation but **data access**.
0036
0037 Gathering is **transitional, and specific to pre-PanDA data**. The current
0038 (Condor-based) production system does not make or preserve the production→Rucio
0039 connection in flight, which is exactly why it must be reconstructed after the
0040 fact — and why the production team's monthly completion emails exist. PanDA
0041 records that connection in flight, so PanDA-produced data carries it from the
0042 start. The sweep therefore backfills pre-PanDA campaigns only; reference and
0043 access over the recorded links are permanent.
0044
0045 ## Derivation (request filters → produced DID → replicas)
0046
0047 *(Implemented — `match_requests_to_rucio_snapshot`, `_filter_match`,
0048 `import_jlab_rucio_current_snapshot`, `refresh_rucio_snapshots`, and
0049 `_rucio_match_to_output` in `pcs/services.py`.)*
0050
0051 The implemented mechanism is a **filter-based match over a fetched Rucio
0052 snapshot**, not a path-string glob. A campaign's full Rucio listing is fetched
0053 once into a snapshot, then each `ProdTask` is matched to the produced datasets
0054 in it by comparing semantic filter fields. Matching on path strings does not
0055 work: a produced RECO DID carries extra segments (generator, radiation, charge)
0056 and a different Q² spelling than the requested EVGEN path.
0057
0058 **Snapshot fetch — Rucio → JSON:**
0059
0060 - `import_jlab_rucio_current_snapshot` authenticates to JLab Rucio and calls
0061   `fetch_jlab_rucio_campaign` for both `/RECO/<campaign>` and `/FULL/<campaign>`.
0062   Each fetch does a `/dids/<scope>/dids/search` for `<campaign_path>/*`, then a
0063   per-dataset metadata + `/replicas/.../datasets` read (run in a thread pool to
0064   stay under the request timeout), yielding for each produced dataset its `did`,
0065   `length`, `bytes`, and per-RSE `rse_replicas`.
0066 - The snapshot is written to one JSON file per campaign under the snapshot
0067   directory (`current-<campaign>.json`) and reused for the match.
0068
0069 **Match — request filters → produced DID(s):**
0070
0071 - `match_requests_to_rucio_snapshot` indexes the snapshot's datasets and extracts
0072   the filter axes of each produced DID. For every `ProdTask` in the campaign it
0073   reads the request filter block (the persisted `csv_import.filters`, or a fresh
0074   extract from the CSV input path) and compares the two via `_filter_match` on the
0075   shared semantic axes — detector, beam, physics, Q² overlap, and (for
0076   single-particle paths) species/energy — never on path strings.
0077 - Each matching produced dataset is converted by `_rucio_match_to_output` into an
0078   `overrides.outputs` entry carrying `did`, `stage`, `version`, derived
0079   `filters`, per-RSE `rses` (`files`/`total`/`complete`), aggregate `file_count`,
0080   `bytes`, `complete`, and `checked_at`. Datasets in the snapshot that no request
0081   matched are stashed on `campaign.data['rucio_unmatched']` for the catalog to
0082   surface.
0083
0084 **Completeness — replicas, counts:**
0085
0086 - Per-RSE completeness comes from each replica's available-vs-total file count in
0087   the snapshot record; a dataset is `complete` only when every RSE replica is
0088   fully available.
0089 - File count and byte size are taken as the max across RSE replicas (Rucio
0090   reports them per replica, identical across RSEs).
0091
0092 ## PCS data model (write target)
0093
0094 The links go onto **`ProdTask.overrides`** (JSONField) under a reserved key — the
0095 same interim convention that already holds `input_dataset_dids` and the
0096 `public_catalog_*` fields. Relevant `ProdTask` fields:
0097
0098 - `input_source_location` (property; `Dataset.source_location`, i.e.
0099   `Dataset.metadata['source']['location']`, `csv_file` fallback) — the requested
0100   `/volatile/eic/EPIC/EVGEN/<suffix>` path the request filter fields are extracted
0101   from for the match.
0102 - `campaign` → `Campaign` (FK) — selects the Rucio snapshot to match against.
0103 - `dataset` → `Dataset` (FK) — the PCS *output* dataset. Its `did` is the
0104   PCS-composed `group.EIC:….b{N}` identifier and `detector_config` the detector.
0105   For **pre-PanDA Condor production** this is a different namespace from the
0106   produced `epic:/RECO/…` Rucio DID (hence the filter-based match below). PanDA
0107   production instead carries the composed identity *as* its Rucio DID (see
0108   Phase 4), so there the two coincide.
0109 - `request` → `ProdRequest` (FK) — originating PWG/DSC request; carries
0110   `nevents` (the requested event count), intended for a future
0111   expected-vs-actual completeness check. The implemented completeness is per-RSE
0112   replica availability only.
0113 - `overrides` — the interim JSON the links are written to.
0114
0115 `overrides.outputs` — a list, one entry per produced Rucio dataset
0116 (lifecycle-neutral, never aggregated); the single home for the produced-output
0117 ↔ task association, read via the `ProdTask.outputs` accessor:
0118
0119 ```json
0120 {
0121   "outputs": [
0122     {
0123       "did": "epic:/RECO/26.04.1/epic_craterlake/<suffix>",
0124       "stage": "RECO",
0125       "version": "26.04.1",
0126       "filters": {"detector": "epic_craterlake", "beam": "10x100", "physics": "DIS", "q2": "", "species": "", "energy": ""},
0127       "rses": [{"rse": "BNL-XRD", "files": 1234, "total": 1234, "complete": true}],
0128       "file_count": 1234,
0129       "bytes": 1234567890,
0130       "complete": true,
0131       "checked_at": "<iso8601>"
0132     }
0133   ]
0134 }
0135 ```
0136
0137 The same schema serves current and past campaigns — today's current is
0138 tomorrow's past, with no reshape on transition. `migrate_outputs_schema()`
0139 (standalone `scripts/migrate_outputs_schema.py`) folded the legacy `past_output`
0140 block and the old `csv_import.output` rollup onto it.
0141
0142 ## Architecture
0143
0144 **Gather** *(implemented)* — standard credentialed-async pattern, no new
0145 substrate: the catalog's **Update from Rucio** button POSTs to
0146 `prod-tasks/rucio-snapshot-update/`, which publishes a `rucio_snapshot_update`
0147 message to the prod-ops agent. The agent's `_handle_rucio_snapshot_update`
0148 dispatches a `run_in_background` doer (`_do_rucio_snapshot_update`, holds the
0149 proxy) that runs `refresh_rucio_snapshots` — fetch the JLab Rucio snapshot for the
0150 current (and last) campaign(s) and rematch produced datasets onto each task's
0151 `overrides.outputs`. On completion it publishes `rucio_snapshot_ready`, which the
0152 catalog page receives over the SSE relay (`EventSource`) and refreshes live,
0153 internally and through the swf-remote streaming proxy. The web tier holds no
0154 credential.
0155
0156 - Trigger: the **Update from Rucio** button (on demand). No nightly cron and no
0157   per-campaign Sweep button are wired today.
0158 - Unit of work: the current and last campaigns — for each, fetch the snapshot
0159   once and match the campaign's `ProdTask` rows against it; the receiver thread
0160   never blocks.
0161
0162 **Reference** *(future, building on the gathered links)* — a catalog read over the stored links, no credential: the existing
0163 filter set (`EPICPROD_TASK_CATALOG.md` §7) collects the tasks' `outputs`
0164 DIDs across the filtered set. Dataset-level reference is a pure read of the cached
0165 links; file/PFN expansion is resolved **live against Rucio on demand** — file
0166 lists are not cached, Rucio is their authority. Surfaced in the catalog's "Rucio
0167 Monitor" feed (DID link, RSE badges, file count) per `EPICPROD_TASK_CATALOG.md`
0168 §6, and as an action to list or export the produced datasets for a selection. A
0169 **prod-navigation CLI** is a second front-end on this same filter→resolve REST,
0170 covered by its own design doc (downstream of these links existing).
0171
0172 **Access** *(future)* — fetch the produced data over XRootD, the same credentialed substrate
0173 as payload-log: a `run_in_background` doer constructs the per-RSE XRootD PFN
0174 (replace the `epic:` DID prefix with the RSE prefix), `xrdcp`s a sample or a small
0175 whole file under the proxy, caches it, and pushes the result to the browser on
0176 `/topic/epictopic`. Sample and small-whole-file pull only; inspection follows the
0177 payload-log model. Each is an individual handler.
0178
0179 ## Proto-plan
0180
0181 **Phase 1 — gather.** *(Implemented.)* Fetch the campaign's Rucio snapshot once;
0182 match each `ProdTask` to the produced datasets in it on the shared filter axes
0183 (`_filter_match`); write `overrides.outputs`. Render in the catalog; push
0184 `rucio_snapshot_ready` on completion. Validate by spot-checking a sample against
0185 an independent manual Rucio query.
0186
0187 **Phase 2 — reference.** *(Future, building on the gathered links.)* Catalog
0188 filter → the produced Rucio datasets, with on-demand file/PFN expansion (live,
0189 uncached) — the system surfacing the data it produced from its provenance record.
0190 The prod-navigation CLI is a second front-end on this REST.
0191
0192 **Phase 3 — access.** *(Future.)* Fetch produced data over XRootD from the stored
0193 links — sample, small-whole-file pull, examine — reusing the payload-log xrootd
0194 doer.
0195
0196 **Phase 4 — capture at source (PanDA data).** PanDA makes the production→Rucio
0197 connection in flight, so for PanDA-produced tasks the output DID is recorded at
0198 submission time (extending `record-submission`, which already writes
0199 `panda_task_id`) rather than reconstructed by a sweep. That DID is the PCS
0200 composed identity name itself (`group.EIC:…`) — PanDA production uses our
0201 composed names throughout, so there is no separate `epic:/RECO/…` reference and
0202 no cross-namespace match. The sweep stays a backfill tool for pre-PanDA
0203 campaigns, whose legacy DIDs PCS records and presents as found.