Back to home page

EIC code displayed by LXR

 
 

    


Warning, /swf-monitor/docs/MCP_STABILIZATION_STATUS.md is written in an unsupported language. File is not indexed.

0001 # MCP Stabilization Status
0002 
0003 Date: 2026-04-27
0004 
0005 Related plan: [MCP_STABILIZATION_PLAN.md](MCP_STABILIZATION_PLAN.md)
0006 
0007 ## Summary
0008 
0009 The first stabilization pass has been implemented, committed, pushed, deployed,
0010 and verified on swf-testbed.
0011 
0012 Commit:
0013 
0014 ```text
0015 ed87bd8 Stabilize local MCP service operation
0016 ```
0017 
0018 The live PanDA bot and testbed bot now initialize MCP successfully through the
0019 local loopback ASGI endpoint.
0020 
0021 ## Current Operating Model
0022 
0023 MCP on swf-testbed is operated as local stateless POST request/response MCP:
0024 
0025 - Local endpoint: `http://127.0.0.1:8001/swf-monitor/mcp/`
0026 - Public Apache path remains present for routing, but long-lived GET/SSE
0027   streaming is not an operational dependency.
0028 - `DJANGO_MCP_GLOBAL_SERVER_CONFIG["stateless"] = True`
0029 - The ASGI service remains isolated from the main mod_wsgi Django site.
0030 
0031 The useful supported surface is:
0032 
0033 - `initialize`
0034 - `tools/list`
0035 - `tools/call`
0036 
0037 ## Implemented Changes
0038 
0039 1. MCP stateless mode enabled.
0040 
0041    `src/swf_monitor_project/settings.py` now sets:
0042 
0043    ```python
0044    "stateless": True
0045    ```
0046 
0047 2. ASGI worker bounded.
0048 
0049    `swf-monitor-mcp-asgi.service` now runs uvicorn with:
0050 
0051    ```text
0052    --workers 4
0053    --limit-concurrency 32
0054    --timeout-graceful-shutdown 15
0055    ```
0056 
0057    `TimeoutStopSec=30` is also set at the systemd layer.
0058 
0059 3. Apache MCP proxy simplified.
0060 
0061    `apache-swf-monitor.conf` now treats MCP as bounded request/response
0062    traffic:
0063 
0064    - proxy timeout reduced from 3600s to 60s
0065    - streaming-specific `proxy-sendchunked` and `no-gzip` settings removed
0066 
0067 4. Bots moved to local MCP by default.
0068 
0069    PanDA bot and testbed bot default to:
0070 
0071    ```text
0072    http://127.0.0.1:8001/swf-monitor/mcp/
0073    ```
0074 
0075 5. Bot exception paths fixed.
0076 
0077    PanDA bot initializes tool metadata before MCP calls can fail, preventing
0078    the observed secondary `UnboundLocalError`. PanDA and testbed bots now log
0079    unexpected response-task exceptions and return a visible Mattermost error
0080    instead of failing silently.
0081 
0082 6. Async DB logging fixed.
0083 
0084    `DbLogHandler` no longer writes Django ORM records directly from the caller
0085    context. It now queues log payloads and writes them from a background thread,
0086    avoiding Django async-context ORM errors.
0087 
0088 7. MCP health endpoint added.
0089 
0090    New endpoint:
0091 
0092    ```text
0093    /swf-monitor/api/mcp-health/
0094    ```
0095 
0096    It verifies that Django can serve a request and reach the default database
0097    without invoking MCP transport/session code.
0098 
0099 8. MCP watchdog added.
0100 
0101    New files:
0102 
0103    - `scripts/mcp_watchdog.py`
0104    - `swf-monitor-mcp-watchdog.service`
0105    - `swf-monitor-mcp-watchdog.timer`
0106 
0107    The watchdog checks:
0108 
0109    - MCP health endpoint
0110    - MCP `initialize`
0111    - MCP `tools/list`
0112 
0113    When run by the timer with `--restart`, it restarts
0114    `swf-monitor-mcp-asgi.service` after a failed probe.
0115 
0116 9. Documentation updated.
0117 
0118    `docs/MCP.md` and `docs/PRODUCTION_DEPLOYMENT.md` now describe the local
0119    stateless request/response operating model and the watchdog.
0120 
0121 ## Deployment Performed
0122 
0123 Deployment command used:
0124 
0125 ```bash
0126 sudo /opt/swf-monitor/bin/deploy-swf-monitor.sh branch infra/baseline-v35
0127 ```
0128 
0129 Deployment result:
0130 
0131 - release: `branch-infra-baseline-v35`
0132 - deployed commit: `ed87bd8`
0133 - Apache health check: passed
0134 - Apache configuration synced from repository canonical
0135 - ASGI worker restarted by deploy script
0136 - PanDA bot restarted by deploy script
0137 - testbed bot restarted by deploy script
0138 
0139 Additional manual systemd unit sync was required because the deploy script
0140 does not install service unit definitions:
0141 
0142 ```bash
0143 sudo install -o root -g root -m 644 /opt/swf-monitor/current/swf-monitor-mcp-asgi.service /etc/systemd/system/swf-monitor-mcp-asgi.service
0144 sudo install -o root -g root -m 644 /opt/swf-monitor/current/swf-monitor-mcp-watchdog.service /etc/systemd/system/swf-monitor-mcp-watchdog.service
0145 sudo install -o root -g root -m 644 /opt/swf-monitor/current/swf-monitor-mcp-watchdog.timer /etc/systemd/system/swf-monitor-mcp-watchdog.timer
0146 sudo systemctl daemon-reload
0147 sudo systemctl restart swf-monitor-mcp-asgi.service
0148 sudo systemctl enable --now swf-monitor-mcp-watchdog.timer
0149 ```
0150 
0151 ## Live Verification
0152 
0153 Verified after deployment:
0154 
0155 - `swf-monitor-mcp-asgi.service`: active
0156 - `swf-panda-bot.service`: active
0157 - `swf-testbed-bot.service`: active
0158 - `swf-monitor-mcp-watchdog.timer`: active
0159 - MCP health endpoint returned:
0160 
0161   ```json
0162   {
0163     "ok": true,
0164     "service": "swf-monitor-mcp-asgi",
0165     "database": "ok",
0166     "mcp_stateless": true
0167   }
0168   ```
0169 
0170 - Watchdog direct probe returned:
0171 
0172   ```text
0173   MCP watchdog OK: 45 tools
0174   ```
0175 
0176 - PanDA bot journal showed:
0177 
0178   ```text
0179   Listening on #pandabot ... MCP: http://127.0.0.1:8001/swf-monitor/mcp/
0180   HTTP MCP: 13 tools
0181   ```
0182 
0183 - Testbed bot journal showed:
0184 
0185   ```text
0186   Listening on #swf-testbed-bot + DMs (MCP: http://127.0.0.1:8001/swf-monitor/mcp/)
0187   Discovered 44 tools via MCP
0188   ```
0189 
0190 User confirmed the bot MCP path is working after deployment.
0191 
0192 ## Notes And Caveats
0193 
0194 The current `django-mcp-server` adapter still creates and shuts down a
0195 `StreamableHTTPSessionManager` per request internally. Stateless mode removes
0196 server-side MCP session dependence and keeps requests short, but it is not a
0197 full replacement for a correct lifespan-managed ASGI MCP implementation.
0198 
0199 One watchdog run failed during the exact ASGI restart window and restarted the
0200 ASGI service. A subsequent manual watchdog run succeeded. This is expected for
0201 the initial enable/restart sequence.
0202 
0203 The PanDA bot performs model/cache initialization on startup and may show high
0204 CPU briefly after restart. This is separate from the MCP ASGI worker lockup
0205 problem.
0206 
0207 ## Remaining Follow-Up
0208 
0209 1. Monitor journals and process CPU after sustained use.
0210 
0211    Watch:
0212 
0213    ```bash
0214    sudo journalctl -u swf-monitor-mcp-asgi.service -f
0215    sudo journalctl -u swf-monitor-mcp-watchdog.service -f
0216    sudo journalctl -u swf-panda-bot.service -f
0217    ```
0218 
0219 2. Consider updating the deploy script to install changed systemd unit files
0220    automatically, with validation before reload.
0221 
0222 3. Consider optimizing deployment venv handling. The current deploy script
0223    copies the virtual environment every deploy because releases are
0224    self-contained; in principle the venv only needs to change when dependency
0225    inputs change.
0226 
0227 4. If remote MCP or GET/SSE streaming becomes a real requirement, implement it
0228    as a dedicated ASGI app with an application-lifetime
0229    `StreamableHTTPSessionManager` and load-test it before advertising support.
0230 
0231 5. Add request timing metrics around MCP calls if the endpoint still shows
0232    unexplained stalls under normal local bot usage.