swf-monitor/docs/EPICPROD_VALIDATION.md

0001 # ePIC Production Validation
0002
0003 ePIC production validation connects three systems. epicprod runs automated ePIC
0004 production through PanDA, producing simulation and reconstruction data. Hydra, the
0005 ePIC validation application, produces validation plots from that data. argus-ai
0006 assesses those plots and returns a natural-language judgment. This document
0007 describes the loop and proposes the two interfaces that join the systems: the
0008 availability signal from epicprod to Hydra, and the assessment handoff from Hydra
0009 to argus-ai.
0010
0011 This should be read as proposal, not established design.
0012
0013 The assessment application is described in
0014 [argus-ai.md](https://github.com/BNLNPPS/corun-ai/blob/master/docs/argus-ai.md).
0015 The loop draws on the produced-data availability signal in
0016 [EPICPROD_DATA_LINEAGE.md](EPICPROD_DATA_LINEAGE.md) and the configuration record in
0017 [PCS.md](PCS.md).
0018
0019 ## Components
0020
0021 - **epicprod** — automated ePIC production through PanDA; the source of produced
0022   data and of the availability signal.
0023 - **PCS** — Physics Configuration System; the configuration and campaign record.
0024 - **Hydra** — the ePIC validation application; produces validation plots.
0025 - **argus-ai** — the assessment application; assesses a target and returns a
0026   natural-language result. See
0027   [argus-ai.md](https://github.com/BNLNPPS/corun-ai/blob/master/docs/argus-ai.md).
0028
0029 ## The loop
0030
0031 ```
0032 PanDA completes a task/dataset
0033   → epicprod signals availability (catalog + event)
0034     → Hydra produces validation plots
0035       → argus-ai assesses the plots → natural-language judgment
0036         → delivered to Mattermost and any registered endpoint
0037 ```
0038
0039 ## Availability (epicprod → Hydra)
0040
0041 Completion is determined by PanDA, which monitors task processing and is aware
0042 of completion. On that completion epicprod signals availability.
0043
0044 The same availability information is offered two ways:
0045
0046 - **Campaign-catalog JSON** — a comprehensive view of the current campaign: for
0047   each task/dataset, its configuration tags, campaign, request, status, and the
0048   produced Rucio references ([EPICPROD_DATA_LINEAGE.md](EPICPROD_DATA_LINEAGE.md))
0049   with file counts and completeness. A consumer reads it
0050   and compares against its previous read to find what is new and ready to validate.
0051   The catalog is described in [PCS.md](PCS.md).
0052 - **Live event** — a per-unit notification, the moment a unit becomes available,
0053   delivered over SSE to subscribers through the swf-remote streaming proxy
0054   ([SSE_PUSH.md](SSE_PUSH.md), [SSE_RELAY.md](SSE_RELAY.md)).
0055
0056 The signal is per task/dataset — the unit that completes and can be validated —
0057 delivered as each becomes available. Completeness travels with it (file counts,
0058 expected against actual), so a unit can be offered for validation once it reaches a
0059 chosen threshold.
0060
0061 ## Hydra
0062
0063 Hydra takes the availability information and the produced-data references and
0064 returns validation plots.
0065
0066 ## Assessment (Hydra → argus-ai)
0067
0068 When a validation is available, Hydra would notify argus-ai, proposing an
0069 assessment of that task/dataset. Whether an assessment then runs automatically is a
0070 per-source, per-target setting, so the assessment rate stays under operator
0071 control. The assessment itself — its inputs, execution, and history and benchmark
0072 comparison — is described in
0073 [argus-ai.md](https://github.com/BNLNPPS/corun-ai/blob/master/docs/argus-ai.md).
0074
0075 An assessment can also be triggered by user request via PanDAbot, which passes the
0076 request to argus-ai through corun-ai's MCP service and returns a natural-language
0077 assessment of the validation.
0078
0079 One assessment can cover a single task/dataset or a group of them — a request or a
0080 benchmark — independent of the per-unit availability signal.
0081
0082 ## Delivery
0083
0084 When an assessment completes, argus-ai delivers the result to the destinations
0085 registered for that request: Mattermost via PanDAbot, and any registered REST
0086 endpoints. The requestor is recorded.
0087
0088 ## Validation track
0089
0090 Validation and its assessment are a first-class part of the production workflow,
0091 visible across the loop and recorded against the task/dataset.
0092
0093 ## Related
0094
0095 - [PCS.md](PCS.md) — the configuration and campaign record.
0096 - [EPICPROD_DATA_LINEAGE.md](EPICPROD_DATA_LINEAGE.md) — produced-dataset Rucio references; the availability signal draws on these.
0097 - [SSE_PUSH.md](SSE_PUSH.md), [SSE_RELAY.md](SSE_RELAY.md) — the notification mechanism the live event uses.
0098 - [argus-ai.md](https://github.com/BNLNPPS/corun-ai/blob/master/docs/argus-ai.md) — the assessment application.