swf-testbed/docs/architecture.md

0001 # Architecture Overview
0002
0003 System design and components of the SWF Testbed.
0004
0005 ## Overview
0006
0007 The testbed plan is based on ePIC streaming computing model WG discussions
0008 in the streaming computing model meeting[^1], guided by the ePIC streaming
0009 computing model report[^2], and the ePIC workflow management system
0010 requirements draft[^3].
0011
0012 ## Testbed plan
0013
0014 The testbed prototypes the ePIC streaming computing model's workflows and
0015 dataflows from Echelon 0 (E0) egress (the DAQ exit buffer)
0016 through the processing that takes place at the two Echelon 1 computing
0017 facilities at BNL and JLab.
0018
0019 The testbed scope, timeline and workplan are described in a planning
0020 document[^4]. Detailed progress tracking and development discussion is in a
0021 progress document[^5].
0022
0023 See the E0-E1 overview slide deck [^6] for more information on the E0-E1
0024 workflow and dataflow.
0025 The following is a schematic of the system the testbed targets (from the blue
0026 DAQ external subnet rightwards).
0027
0028 ![E0-E1 workflow schematic](../images/E0-E1_workflow_schematic.png)
0029
0030 *Figure: E0-E1 data flow and processing schematic*
0031
0032 ## Design and implementation
0033
0034 Overall system design and implementation notes:
0035
0036 - We aim to follow the [Software Statement of Principles](https://eic.github.io/activities/principles.html) of the EIC and ePIC in the design and
0037   implementation of the testbed software.
0038 - The implementation language is Python 3.9 or greater.
0039 - Testbed modules are implemented as a set of loosely coupled agents, each
0040   with a specific role in the system.
0041 - The agents communicate via messaging, using ActiveMQ as the message
0042   broker.
0043 - The PanDA [^7] distributed workload management system and its ancillary
0044   components are used for workflow orchestration and workload execution.
0045 - The Rucio [^8] distributed data management system is used for
0046   management and distribution of data and associated metadata, in close
0047   orchestration with PanDA.
0048 - High quality monitoring and centralized management of system data (metadata,
0049   bookkeeping, logs etc.) is a primary design goal. Monitoring and system
0050   data gathering and distribution is implemented via a web service backed
0051   by a relational database, with a REST API for data access and reporting.
0052
0053 ## Software organization
0054
0055 The streaming workflow (swf prefix) set of repositories make up the software for the ePIC
0056 streaming workflow testbed project, development begun in June 2025.
0057 This swf-testbed repository serves as the umbrella repository for the testbed.
0058 It's the central place for documentation, overall configuration,
0059 and high-level project information.
0060
0061 The repositories mapping to testbed components are:
0062
0063 ### [swf-monitor](https://github.com/BNLNPPS/swf-monitor)
0064
0065 A web service providing system monitoring and comprehensive information about the testbed's state, both via browser-based dashboards and a json based REST API.
0066
0067 This module manages the databases used by the testbed, and offers a REST API for other agents in the system to report status and retrieve information. It acts as a listener for the ActiveMQ message broker, receiving messages from other agents, storing relevant data in the database and presenting message histories in the monitor. It hosts a Model Context Protocol (MCP) server for the agents to share information with LLM clients to create an intelligent assistant for the testbed.
0068
0069 ### [swf-daqsim-agent](https://github.com/BNLNPPS/swf-daqsim-agent)
0070
0071 This is the information agent designed to simulate the Data Acquisition (DAQ)
0072 system and other EIC machine and ePIC detector influences on streaming
0073 processing. This dynamic simulator acts as the primary input and driver of
0074 activity within the testbed.
0075
0076 ### [swf-data-agent](https://github.com/BNLNPPS/swf-data-agent)
0077
0078 This is the central data handling agent within the testbed. It listens to
0079 the swf-daqsim-agent, manages Rucio subscriptions of run datasets and STF
0080 files, create new run datasets, and sends messages to the
0081 swf-processing-agent for run processing and to the swf-fastmon-agent for new
0082 STF availability. It will also have a 'watcher' role to identify and report
0083 stalls or anomalies.
0084
0085 ### [swf-processing-agent](https://github.com/BNLNPPS/swf-processing-agent)
0086
0087 This is the prompt processing agent that configures and submits PanDA
0088 processing jobs to execute the streaming workflows of the testbed.
0089
0090 ### [swf-fastmon-agent](https://github.com/BNLNPPS/swf-fastmon-agent)
0091
0092 This is the fast monitoring agent designed to consume (fractions of) STF data
0093 for quick, near real-time monitoring. This agent will reside at the E1s and perform
0094 remote data reads from STF files in the DAQ exit buffer, skimming a fraction
0095 of the data of interest for fast monitoring. The agent will be notified of new
0096 STF availability by the swf-data-agent.
0097
0098 ### [swf-mcp-agent](https://github.com/BNLNPPS/swf-mcp-agent)
0099
0100 This agent may be added in the future for managing Model Context Protocol
0101 (MCP) services. For the moment, this is done in swf-monitor (colocated with
0102 the agent data the MCP services provide).
0103
0104 Note Paul Nilsson's [ask-panda example](https://github.com/PalNilsson/ask-panda) of
0105 MCP server and client; we want to integrate it into the testbed. Tadashi Maeno has also implemented MCP capability on the core PanDA services, we will want to integrate that as well.
0106
0107 ### Testbed System Architecture
0108
0109 ### Deployment Modes
0110
0111 The testbed supports two deployment modes:
0112
0113 **Development Mode** (Docker-managed infrastructure):
0114 - PostgreSQL and ActiveMQ run as Docker containers
0115 - Managed by `docker-compose.yml`
0116 - Started/stopped via `swf-testbed start/stop`
0117 - Ideal for development and testing environments
0118
0119 **System Mode** (System-managed infrastructure):
0120 - PostgreSQL and ActiveMQ run as system services (e.g., `postgresql-16.service`, `artemis.service`)
0121 - Managed by system service manager (systemd)
0122 - Testbed manages only agent processes via `swf-testbed start-local/stop-local`
0123 - Typical for shared development systems like pandaserver02
0124
0125 Both modes use supervisord to manage Python agent processes. Use `python report_system_status.py` to verify service availability and determine which mode is active.
0126
0127 ### Database Schema
0128 The database schema for the monitoring system is automatically maintained in the swf-monitor repository. View the current schema:
0129 - [testbed-schema.dbml](https://github.com/BNLNPPS/swf-monitor/blob/main/testbed-schema.dbml)
0130
0131 To generate the schema manually:
0132 ```bash
0133 cd swf-monitor/src
0134 python manage.py dbml > ../testbed-schema.dbml
0135 ```
0136
0137 To visualize the schema, paste the DBML content into [dbdiagram.io](https://dbdiagram.io/).
0138
0139 ### Agent Identity Management
0140
0141 The testbed uses a global sequential agent ID system to ensure unique agent identification:
0142
0143 **Agent Naming Convention:**
0144 - Format: `{agent_type}-agent-{username}-{sequential_id}`
0145 - Examples: `data-agent-wenauseic-1`, `processing-agent-wenauseic-2`, `daq-simulator-wenauseic-3`
0146
0147 **Sequential ID Assignment:**
0148 - Global counter shared across all agent types
0149 - Stored in PersistentState database model
0150 - Thread-safe atomic assignment via `SELECT FOR UPDATE`
0151 - API endpoint: `/api/state/next-agent-id/`
0152
0153 **Benefits:**
0154 - Guaranteed unique agent names across the system
0155 - Easy tracking of agent generations over time
0156 - No collision risk even with concurrent agent startups
0157 - Human-readable sequential numbering (1, 2, 3...)
0158
0159 This replaces the previous random suffix approach and ensures long-term uniqueness as the system scales.
0160
0161 ### Namespace Isolation
0162
0163 Namespaces provide logical isolation for workflows and agents:
0164
0165 - **Purpose**: Allow users to discriminate their workflows from others and from other users
0166 - **Scope**: All workflow messages include namespace for filtering
0167 - **Configuration**: Set in `workflows/testbed.toml` before running workflows
0168 - **Collaboration**: Multiple users can share a namespace for collaborative work
0169 - **UI Integration**: Monitor UI supports namespace filtering on agents, executions, and messages
0170
0171 Example namespace configuration:
0172 ```toml
0173 [testbed]
0174 namespace = "epic-fastmon-dev"
0175 ```
0176
0177 ### Workflow Definition Architecture
0178
0179 The testbed implements a three-layer workflow architecture to support both complex orchestration and agile parameter experimentation:
0180
0181 **Layer 1: Snakemake (Complex Workflows)**
0182 - For complex multi-facility workflows with dependencies
0183 - Handles orchestration-heavy scenarios (e.g., calibration workflows)
0184 - Integrates with PanDA for distributed execution
0185 - Used when workflow logic involves conditional execution and complex dependencies
0186
0187 **Layer 2: TOML Configuration (Parameter Management)**
0188 - Human-readable parameter definitions for workflow variants
0189 - Supports systematic experimentation and comparison
0190 - Version control friendly for tracking parameter evolution
0191 - Enables rapid iteration on workflow configurations
0192
0193 **Layer 3: Python + SimPy (Execution Engine)**
0194 - Direct SimPy integration for simulation and execution
0195 - Native Python expressiveness for workflow logic
0196 - Real-time parameter-driven execution
0197 - Optimized for fast processing workflows requiring rapid experimentation
0198
0199 **Workflow Type Classification:**
0200 - **Fast Processing Workflows**: Use TOML + Python+SimPy (Layers 2 & 3) for rapid experimentation with worker counts, processing targets, and sampling rates
0201 - **Complex Orchestration Workflows**: Use all three layers when sophisticated dependency management and multi-facility coordination are required
0202
0203 This architecture maintains testbed agility while supporting both simple parameter experiments and complex production-like workflows.
0204
0205 The following diagram shows the testbed's agent-based architecture and data flows:
0206
0207 ```mermaid
0208 graph LR
0209     DAQ[E1/E2 Fast Monitor]
0210     PanDA[PanDA]
0211     Rucio[Rucio]
0212     ActiveMQ[ActiveMQ]
0213     PostgreSQL[PostgreSQL]
0214     DAQSim[swf-daqsim-agent]
0215     DataAgent[swf-data-agent]
0216     ProcAgent[swf-processing-agent]
0217     FastMon[swf-fastmon-agent]
0218     Monitor[swf-monitor]
0219     WebUI[Web Dashboard]
0220     RestAPI[REST API]
0221     MCP[MCP Server]
0222
0223     DAQSim -->|1| ActiveMQ
0224     ActiveMQ -->|1| DataAgent
0225     ActiveMQ -->|1| ProcAgent
0226     DataAgent -->|2| Rucio
0227     ProcAgent -->|3| PanDA
0228
0229     DAQSim -->|4| ActiveMQ
0230     ActiveMQ -->|4| DataAgent
0231     ActiveMQ -->|4| ProcAgent
0232     ActiveMQ -->|4| FastMon
0233
0234     DataAgent -->|5| Rucio
0235     ProcAgent -->|6| PanDA
0236     FastMon -.->|7| DAQ
0237
0238     ActiveMQ -.-> Monitor
0239     Monitor --> PostgreSQL
0240     Monitor --> WebUI
0241     Monitor --> RestAPI
0242     Monitor --> MCP
0243 ```
0244
0245 *Figure: Testbed agent architecture and data flow diagram*
0246
0247 **Workflow Steps:**
0248 1. **Run Start** - daqsim-agent generates a run start broadcast message indicating a new datataking run is beginning
0249 2. **Dataset Creation** - data-agent sees the run start message and has Rucio create a dataset for the run
0250 3. **Processing Task** - processing-agent sees the run start message and establishes a PanDA processing task for the run
0251 4. **STF Available** - daqsim-agent generates a broadcast message that a new STF data file is available
0252 5. **STF Transfer** - data-agent sees the message and initiates Rucio registration and transfer of the STF file to E1 facilities
0253 6. **STF Processing** - processing-agent sees the new STF file in the dataset and transferred to the E1 by Rucio, and initiates a PanDA job to process the STF
0254 7. **Fast Monitoring** - fastmon-agent sees the broadcast message that a new STF data file is available and performs a partial read to inject a data sample into E1/E2 fast monitoring
0255
0256 ## References
0257
0258 [^1]: [ePIC streaming computing model meeting page](https://docs.google.com/document/d/1t5vBfgro8Kb6MKc-bz2Y67u3cOCpHK4dfepbJX-nEbE/edit?tab=t.0#heading=h.y3evqgz3sc98)
0259
0260 [^2]: [ePIC streaming computing model report](https://zenodo.org/records/14675920)
0261
0262 [^3]: [ePIC workflow management system requirements draft](https://docs.google.com/document/d/1OmAGzFgZgEP6ntuRkP51kiOqF_0uh_RPjq8wgdTwb2A/edit?tab=t.0#heading=h.g1vlz8vqp7ht)
0263
0264 [^4]: [ePIC streaming workflow testbed planning document](https://docs.google.com/document/d/1mPqMsjHiymkeAB7uih_8TjFIluwM8MENIWZF3EDwNrU/edit?tab=t.0)
0265
0266 [^5]: [ePIC streaming workflow testbed progress document](https://docs.google.com/document/d/1PUoo-W6dCeOKsD4VubYTgSxBHBUb4D5dYfVy1oLYh7E/edit?tab=t.0#heading=h.qovfena71s)
0267
0268 [^6]: [E0-E1 overview slide deck](https://docs.google.com/presentation/d/1Vbt68LwBDA-eDghlWWg8278ys_K0axbX/edit?slide=id.g2fdc8697d63_0_18#slide=id.g2fdc8697d63_0_18)
0269
0270 [^7]: [PanDA: Production and Distributed Analysis System](https://link.springer.com/article/10.1007/s41781-024-00114-3)
0271   [PanDA documentation](https://panda-wms.readthedocs.io/en/latest/index.html)
0272   [BNL PanDA startup guide](https://docs.google.com/document/d/1zxtpDb44yNmd3qMW6CS7bXCtqZk-li2gPwIwnBfMNNI/edit?tab=t.0#heading=h.iiqfpuwcgs2k)
0273
0274 [^8]: [Rucio: A Distributed Data Management System](https://link.springer.com/article/10.1007/s41781-019-0026-3)