swf-monitor/docs/PCS_DATASET_REQUEST_WORKFLOW.md

0001 # PCS Production Planning Workflow
0002
0003 This note describes the PCS-based workflow for ePIC production
0004 dataset requests. It is both an implementation guide and a
0005 summary to confirm that it matches the production
0006 planning workflow foreseen by the production team and taking shape in `eic/epic-prod`.
0007
0008 ## Summary
0009
0010 PCS should be the authoritative record for production
0011 requests, which take form as
0012 task specifications composed from dataset, tag, and production configuration specs.
0013 The tasks are also the source of record for downstream production state as
0014 production progresses. Mattermost/pandabot is a two-way conversational
0015 interface between users and PCS. The `eic/epic-prod` issue/PR/Jekyll workflow is
0016 the public catalogue projection of production plans as expressed in PCS data,
0017 with PCS the source of truth.
0018
0019 The current production reality starts from generator-level files supplied by
0020 PWGs or DSCs, input as CSV file spec manifests. PCS must represent
0021 this present reality, describing
0022 those externally supplied inputs explicitly, while also providing a clean path to a
0023 future mode where EVGEN is run as an internal production stage.
0024
0025 ## Current Situation
0026
0027 Sakib has described and prototyped this public intake path:
0028
0029 1. A requester submits a dataset request through a GitHub issue template or,
0030    later, through Mattermost/PanDAbot as dialog front end to issue creation.
0031 2. A GitHub Action triggered by the issue creation appends a row to `eic/epic-prod`:
0032    `docs/_data/datasets.csv`.
0033 3. A pull request is opened for review.
0034 4. Once merged, the Jekyll-generated page at the campaign datasets documentation displays
0035    the new request.
0036
0037 The public CSV currently stores fields such as:
0038
0039 - DSC or PWG
0040 - Dataset Path
0041 - Generator/Dataset Version
0042 - Number of Events
0043 - Background
0044 - New Request
0045 - Pre-TDR Use
0046 - Early Science Use
0047 - Other Use
0048 - Description
0049 - Priority
0050 - Issue
0051
0052 This is metadata that maps onto the PCS components that
0053 compose into either a fully or partially specified production task.
0054
0055 At present, `Generator/Dataset Version` is a loose string,
0056 not a structured PCS EvGen tag, and `Dataset Path` is an external data source or
0057 manifest, not a tag attribute.
0058
0059 ## PCS Authority
0060
0061 PCS is designed to be the authoritative source and record for task configuration:
0062
0063 - request state
0064 - inferred and confirmed metadata
0065 - associated physics, EvGen, simulation, and reconstruction tags
0066 - input and output datasets
0067 - production configuration
0068 - workflow mode and stage structure
0069 - validation status
0070 - public catalogue issue, PR, row, and page references
0071 - user comments
0072 - downstream production status and task IDs
0073
0074 PCS definition and composition into tasks is currently based in a web interface.
0075 Intake goes through a single service layer (`pcs.services`) that is the source
0076 of truth for validation, idempotency, and lifecycle. REST and MCP are peer
0077 surfaces over that layer: each is a thin adapter that turns wire-format input
0078 (HTTP body or MCP args) into a service call and the service result back into
0079 wire-format output.
0080
0081 - **Bots trigger; they do not mediate.** PanDAbot receives a Mattermost
0082   event, then calls a PCS MCP tool. Bots do not embed PCS logic. Intake
0083   decisions, validation, and persistence live in the service layer.
0084 - **Web UI uses REST.** PCS web pages call PCS REST for reads and writes;
0085   nothing in the UI bypasses the service layer.
0086 - **MCP and REST are peers over the same services.** Adding a new
0087   operation is "add the service function, expose it via REST and MCP" —
0088   the two surfaces stay aligned because they share business logic.
0089 - **Scripts and other automation** typically call REST (e.g. `pcs-task-cmd`
0090   is a stdlib HTTP client over REST). They could equally well use MCP;
0091   the contract is the same.
0092 - **GitHub issue creation, when needed,** is performed by PCS server-side
0093   on receipt of a 'go' from the bot/user via REST or MCP — programmatically
0094   precise, so traceability and review are preserved.
0095
0096 User-supplied comments arrive through whichever front-end the user chose
0097 (bot, web, script). PCS records them as part of the request history through
0098 the same service layer. The bot's job is to relay PCS responses (questions,
0099 inferred values, short option lists, validation errors) back to the user;
0100 PCS's job is to compute them.
0101
0102 The intake surface — REST endpoints and the corresponding MCP tools — is
0103 listed under "Intake Surface" below.
0104
0105 ## Model Direction
0106
0107 Tags remain reusable attribute sets. They do not carry concrete file paths or
0108 manifests. A concrete data sample is a dataset, described by one or more tags.
0109
0110 PCS should keep one generalized dataset model and extend it to cover externally
0111 supplied data as well as internally produced data. The key additions are:
0112
0113 - stage or role, e.g. `evgen`, `simu`, `reco`, `full`, `log`
0114 - source kind and source location, e.g. CSV manifest, path, URL, Rucio DID, file list
0115
0116 The present externally supplied EVGEN files are then ordinary PCS datasets with
0117 `stage=evgen` and `source.kind=csv_manifest` or another external source kind.
0118 They are described by PCS physics and EvGen tags, but their file/path/CSV
0119 source is dataset metadata, not tag metadata.
0120 Future PCS/PanDA-produced EVGEN outputs are also ordinary datasets, with
0121 `stage=evgen` and an internal production source when that exists.
0122
0123 Interim implementation: externally supplied EVGEN inputs are represented in the
0124 existing `Dataset.metadata` JSON rather than by new database columns. The
0125 metadata convention is:
0126
0127 ```json
0128 {
0129   "stage": "evgen",
0130   "source": {
0131     "kind": "csv_manifest",
0132     "location": "path/to/input.csv"
0133   }
0134 }
0135 ```
0136
0137 This keeps the external-input capability lightweight and transitional while
0138 preserving the path to a later first-class workflow/stage model for EVGEN run
0139 inside PCS/PanDA.
0140
0141 Production tasks should compose lists of datasets rather than a single dataset:
0142
0143 ```text
0144 ProdTask
0145   prod_config
0146   input_datasets[]
0147   output_datasets[]
0148   overrides
0149   status and submission tracking
0150 ```
0151
0152 The current `ProdTask.dataset` is really the output dataset. The current
0153 `ProdTask.csv_file` is really source metadata for an external input dataset.
0154 Both should migrate in that direction while preserving backward compatibility.
0155
0156 `ProdConfig` should carry workflow-mode defaults: external EVGEN input, internal
0157 EVGEN stage execution, stage template, transformation or executable, splitting
0158 strategy, resources, and site/queue defaults. Task-level `overrides` remain the
0159 last-mile specialization mechanism for a reusable production config.
0160
0161 ## Workflow Modes
0162
0163 The same production intent should be expressible in two modes:
0164
0165 ```text
0166 external EVGEN dataset -> simulation/reconstruction -> output dataset(s)
0167 ```
0168
0169 and later:
0170
0171 ```text
0172 internal EVGEN stage -> simulation stage -> reconstruction stage -> output dataset(s)
0173 ```
0174
0175 The internal EVGEN case should still be one PCS production task or request,
0176 with an internal workflow graph describing the stages. It should not require a
0177 graph of multiple top-level PCS tasks. The model should also allow both modes
0178 to be run and compared under the same physics, EvGen, simulation, and
0179 reconstruction metadata.
0180
0181 ## Lifecycle
0182
0183 The public catalogue publication state is not production readiness. Sakib's
0184 current issue fields are sufficient to create a public planning row and a
0185 partial PCS request, but not a fully specified PCS production task.
0186
0187 The PCS lifecycle is the simple five-state set already on `ProdTask`:
0188
0189 ```text
0190 draft  →  ready  →  submitted  →  completed | failed
0191 ```
0192
0193 `draft` covers every incomplete state — missing metadata, missing tag
0194 mapping, unvalidated inputs, partial public-catalogue projection. When
0195 everything required is in place and the operator has confirmed, the task
0196 transitions to `ready`. From `ready` the operator submits, which
0197 transitions to `submitted`; PanDA then drives the terminal transitions
0198 to `completed` or `failed`.
0199
0200 PCS infers likely values where possible, surfaces missing fields and
0201 validation errors via the same surface (web / REST / MCP), and supports
0202 operator completion through templates and defaults. The visible state
0203 stays `draft` until readiness checks pass and the operator confirms;
0204 there is no separate `needs_metadata`, `planned`, or
0205 `ready_for_operator_review` state — those are sub-states *inside*
0206 `draft` driven by validation, not enumerated transitions.
0207
0208 Readiness checks include path / CSV manifest validity, file readability
0209 where possible, event counts, tag mapping, production config, and
0210 public catalogue projection.
0211
0212 PCS should store the public catalogue mapping internally:
0213
0214 ```text
0215 public_catalog_repo = eic/epic-prod
0216 public_catalog_issue
0217 public_catalog_pr
0218 public_catalog_row_index
0219 public_catalog_csv_path = docs/_data/datasets.csv
0220 public_catalog_row_key = Issue=<issue number>
0221 public_catalog_page_url
0222 public_catalog_commit_sha
0223 ```
0224
0225 The GitHub issue number is the durable update key. The visible row index is a
0226 useful human locator, but advisory.
0227
0228 ## Intake Surface
0229
0230 All intake — from bots (via MCP), scripts, and the web UI — goes through
0231 the same service layer (`pcs.services`). REST endpoints and MCP tools are
0232 peer adapters over the same service functions; the contract (validation,
0233 idempotency, lifecycle rules) is identical on both surfaces.
0234
0235 | Method | Endpoint | Purpose |
0236 |---|---|---|
0237 | POST   | `/pcs/api/datasets/`                          | Generic create. Body carries `metadata` (validated for external source.kind/location). |
0238 | POST   | `/pcs/api/datasets/intake/`                   | Idempotent: given a CSV-manifest location (+ optional tag handles), find-or-create the external EVGEN Dataset, return its DID. |
0239 | POST   | `/pcs/api/prod-tasks/`                        | Generic create. |
0240 | POST   | `/pcs/api/prod-tasks/intake/`                 | Idempotent on a request key (e.g. `epic-prod#<issue>` or `csv_path+row_key`): create a draft ProdTask, ensure linked input Dataset(s), persist `public_catalog_*` mapping fields in `overrides`, return the task. |
0241 | POST   | `/pcs/api/prod-tasks/<pk>/link-input/`        | Link an existing Dataset as input by DID (writes `overrides.input_dataset_did(s)`). Sugar over PATCH. |
0242 | POST   | `/pcs/api/prod-tasks/<pk>/set-status/`        | Lifecycle transition with rule enforcement (e.g. only `ready → submitted`). |
0243 | POST   | `/pcs/api/prod-tasks/record-submission/?name=`| Record JEDI submission outcome (`panda_task_id`, `status='submitted'`). Rejects if `panda_task_id` already set, or `status != 'ready'`. |
0244 | GET    | `/pcs/api/prod-tasks/command/?name=&fmt=`     | Submission artifact — `condor`/`panda`/`jedi`/`dump`. |
0245 | GET    | `/pcs/api/{datasets,prod-tasks}/`             | List with filters (`stage`, `source_kind`, `status`, `public_catalog_issue`, …). |
0246
0247 ### Idempotency keys
0248
0249 The two `intake/` endpoints are idempotent and require a stable key in
0250 the request body:
0251
0252 - `datasets/intake/` keys on `source.location` (+ `source.kind`, default
0253   `csv_manifest`). Repeated calls with the same location return the same
0254   Dataset row.
0255 - `prod-tasks/intake/` keys on either:
0256   - `public_catalog_issue` when the request originated from a GitHub
0257     issue, or
0258   - `(public_catalog_csv_path, public_catalog_row_key)` when the request
0259     is identified by a row in `datasets.csv`.
0260
0261 Repeated calls with the same key return the same ProdTask row, never
0262 duplicate. New input fields are merged into the existing draft (until
0263 the task is locked or submitted).
0264
0265 ### MCP tools
0266
0267 Each MCP tool is a peer to the corresponding REST endpoint, calling the
0268 same service function. The two surfaces stay aligned because they share
0269 business logic.
0270
0271 Read:
0272 - `pcs_dataset_list(stage=None, source_kind=None, source_location=None, scope=None, name_contains=None, limit=20, offset=0)`
0273 - `pcs_dataset_get(did=None, dataset_name=None)`
0274 - `pcs_prodtask_list(status=None, public_catalog_issue=None, name_contains=None, limit=20, offset=0)`
0275 - `pcs_prodtask_get(name)`
0276 - `pcs_prodtask_artifact(name, fmt='dump')` — `condor` / `panda` / `jedi` / `dump`
0277
0278 Write (intake / lifecycle):
0279 - `pcs_dataset_intake(source_location, source_kind='csv_manifest', physics_tag=…, evgen_tag=…, simu_tag=…, reco_tag=…, detector_version=…, detector_config=…, scope='group.EIC.evgen', stage='evgen', description='', created_by=…)` — idempotent on `(source_kind, source_location)`.
0280 - `pcs_prodtask_intake(public_catalog_issue=…, public_catalog_csv_path=…, public_catalog_row_key=…, name=…, dataset=…, prod_config=…, description=…, input_dataset_did=…, public_catalog_*=…)` — idempotent on the catalogue key.
0281 - `pcs_prodtask_link_input(task_name, did=None, dids=None)`
0282 - `pcs_prodtask_set_status(task_name, status)`
0283
0284 Submission itself (`pcs-task-cmd <name> --submit`, which calls
0285 `pandaclient.Client.insertTaskParams()`) is **not** exposed via MCP.
0286 The MCP server runs on swf-monitor and has no operator PanDA auth
0287 context; the operator runs the CLI on a host where their proxy or
0288 OIDC token is live. A future `pcs_prodtask_submit` MCP tool needs an
0289 OIDC service account on swf-monitor first.
0290
0291 ### Public catalogue mapping fields
0292
0293 Stored in `ProdTask.overrides` under reserved keys (no schema columns
0294 needed for the interim model):
0295
0296 `public_catalog_repo`, `public_catalog_issue`, `public_catalog_pr`,
0297 `public_catalog_row_index`, `public_catalog_csv_path`,
0298 `public_catalog_row_key`, `public_catalog_page_url`,
0299 `public_catalog_commit_sha`.
0300
0301 ## Dynamic Public Catalog
0302
0303 The present GitHub/Jekyll public catalog webpage provides a static snapshot of production requests, but does not offer dynamic modification: cloning of established requests into new campaigns, modification of requests and injection into campaigns, adjustment of priorities, withdrawal of requests, and other changes, all of them based on authenticated users with particular production system rights.
0304
0305 PCS already has dynamic listings and edit/copy/delete/update extensible functionality in its web interface, including the task interface. PCS dynamic changes should still preserve traceability/audit/review, perhaps by GitHub PRs at first and later by PCS-native audit logs and approval workflows. This dynamic interface will be developed as a candidate for adoption as the official public catalog interface, once PCS and automated production is proven and established on an ePIC owned server.
0306
0307 ## Roles and Approval
0308
0309 Once PCS is integrated with the ePIC phonebook and COmanage, role assertions gate the dynamic catalog. PWG members author Physics Configs within templated requirements enforced by PCS. Production managers approve those configurations before they propagate to automated production.
0310
0311 ## Implementation Plan
0312
0313 1. Extend `Dataset` with stage and source metadata.
0314 2. Add task-dataset relations for input, output, and intermediate dataset lists.
0315 3. Migrate current `ProdTask.dataset` semantics to an output dataset relation.
0316 4. Move current `ProdTask.csv_file` semantics to external input dataset source
0317    metadata, with backward compatibility during transition.
0318 5. Add workflow-mode/template fields to `ProdConfig`.
0319 6. Add the intake endpoints listed under "Intake Surface"
0320    (`datasets/intake/`, `prod-tasks/intake/`, `link-input/`, the
0321    lifecycle gates on `set-status/` and `record-submission/`).
0322    Expose each as a peer MCP tool calling the same service function,
0323    so PanDAbot and other MCP clients drive intake, status transitions,
0324    comment append, catalogue-row preview, and PCS-driven GitHub issue
0325    creation through the shared service layer rather than constructing
0326    REST queries.
0327 7. Continue to develop the integrated automated production workflow from user input to
0328    running task, including a dynamic web interface providing documentation and
0329    flexible interaction with and control of the automated production system,
0330    as well as the mattermost/bot interface.