Back to home page

EIC code displayed by LXR

 
 

    


Warning, /swf-monitor/docs/MCP_STABILIZATION_PLAN.md is written in an unsupported language. File is not indexed.

0001 # MCP Stabilization Plan
0002 
0003 Date: 2026-04-27
0004 
0005 ## Context
0006 
0007 The swf-monitor MCP endpoint has repeatedly locked up on swf-testbed. The
0008 current production service runs uvicorn with 20 worker processes for
0009 `/swf-monitor/mcp/`, fronted by Apache. Increasing the worker count did not fix
0010 the failure mode: on 2026-04-27 the service became nonresponsive, Apache
0011 returned MCP backend 502s, the PanDA bot timed out waiting for MCP
0012 `initialize`, and systemd had to SIGKILL all uvicorn workers after graceful
0013 shutdown failed.
0014 
0015 This is not primarily a concurrency shortage. It is an MCP transport and
0016 lifecycle problem.
0017 
0018 ## Appraisal
0019 
0020 The installed `django-mcp-server` adapter creates, runs, and shuts down a fresh
0021 `StreamableHTTPSessionManager` per Django request via `async_to_sync`. The MCP
0022 SDK documents `StreamableHTTPSessionManager` as an application-lifetime object
0023 that should be created once and run once. The current adapter may work for short
0024 JSON POST requests, but it is a poor fit for long-lived streamable HTTP/SSE
0025 sessions.
0026 
0027 The deployment also claims streaming HTTP support, but GET/SSE behavior is not
0028 healthy in practice. Direct GET requests to `/swf-monitor/mcp/` return 400 or
0029 406 depending on headers, and logs show clients attempting GET and receiving
0030 406. The system is paying the operational complexity cost of streamable HTTP
0031 without delivering robust streaming semantics.
0032 
0033 On swf-testbed, swf-monitor MCP is used locally. There is no current operational
0034 requirement for remote MCP clients to maintain long-lived streaming sessions.
0035 The useful surface is request/response tool invocation: `initialize`,
0036 `tools/list`, and `tools/call`.
0037 
0038 ## Streaming Clarification
0039 
0040 The proposed stabilization does not replace MCP with arbitrary REST and does
0041 not break the MCP JSON-RPC tool surface. The endpoint should remain MCP over
0042 HTTP for clients that use POST request/response calls.
0043 
0044 The change is to stop depending on long-lived GET/SSE streaming and server-side
0045 MCP session state until there is a proven use case and a correct ASGI
0046 implementation.
0047 
0048 Client implications:
0049 
0050 - PanDA bot, testbed bot, and local scripts that use POST-only JSON-RPC should
0051   continue to work.
0052 - Because swf-monitor MCP is only used locally here, there is no known external
0053   user depending on GET/SSE streaming.
0054 - A generic MCP client that insists on stateful streamable HTTP sessions or
0055   working GET/SSE streams could fail, but that behavior is already unreliable
0056   today.
0057 - If streaming becomes valuable later, reimplement it deliberately with an
0058   ASGI app that owns a lifespan-managed `StreamableHTTPSessionManager`, rather
0059   than the current Django APIView bridge.
0060 
0061 ## Repair Plan
0062 
0063 1. Make MCP stateless request/response first.
0064 
0065    Set `DJANGO_MCP_GLOBAL_SERVER_CONFIG["stateless"] = True`, stop issuing
0066    Django-backed MCP session IDs, and reduce uvicorn from 20 workers to a small
0067    count such as 2 or 4. The current tools are database/API primitives and do
0068    not require server-side MCP session state.
0069 
0070    Authentication implication: low risk. Current MCP authentication is request
0071    middleware, not MCP session state. Bearer token validation still happens per
0072    request. Local unauthenticated POSTs continue to pass unless that policy is
0073    changed explicitly.
0074 
0075 2. Stop advertising streaming as an operational dependency.
0076 
0077    Document the local MCP deployment as POST request/response MCP. Do not
0078    promise GET/SSE support until it is deliberately reimplemented and tested.
0079 
0080 3. Prefer local REST-style HTTP for bots over Python service coupling.
0081 
0082    Short term, point local bot MCP traffic at
0083    `http://127.0.0.1:8001/swf-monitor/mcp/` to bypass Apache/HTTPS for
0084    same-host calls.
0085 
0086    Better follow-up: add local REST endpoints for bot memory and common
0087    SWF/PanDA queries so bots can remain mobile without importing Django service
0088    functions directly. Keep MCP as an AI integration surface, not mandatory
0089    loopback plumbing for every local bot operation.
0090 
0091 4. Add guardrails and observability.
0092 
0093    Add uvicorn concurrency limits, shorter graceful shutdown, a cheap
0094    non-MCP health endpoint, request start/end timing logs, and a watchdog that
0095    restarts `swf-monitor-mcp-asgi.service` if `initialize` or `tools/list`
0096    exceeds a small threshold.
0097 
0098 5. Fix async logging.
0099 
0100    `DbLogHandler` currently writes to `AppLog` through synchronous Django ORM
0101    calls even when invoked from async contexts, producing repeated
0102    `You cannot call this from an async context` errors. Replace it with an
0103    async-safe queue/background-writer design, or disable DB logging in async
0104    uvicorn/bot processes until the queue writer exists. Logging failures must
0105    be visible; silent failure is not acceptable.
0106 
0107 6. Fix bot exception paths.
0108 
0109    Initialize PanDA bot metadata variables before any MCP call can fail. On MCP
0110    failure, return a clear Mattermost-visible error and log the exception. Do
0111    the same review for testbed bot paths. A failed MCP request must not produce
0112    hidden task exceptions such as `UnboundLocalError`.
0113 
0114 ## Verification
0115 
0116 Before deployment:
0117 
0118 - Direct local MCP POST tests: `initialize`, `tools/list`, representative
0119   `swf_*`, `panda_*`, and `pcs_*` calls.
0120 - Apache-proxied MCP POST tests.
0121 - Bot smoke tests against the local endpoint.
0122 - Confirm GET/SSE behavior is intentionally unsupported/documented, or fully
0123   implemented later.
0124 - Confirm `systemctl restart swf-monitor-mcp-asgi.service` exits cleanly
0125   without SIGKILL.
0126 - Confirm async logging no longer emits Django async-context errors.
0127 
0128 ## Execution Order
0129 
0130 Phase 1 should be a stabilization patch: stateless MCP, reduced worker count,
0131 local bot endpoint, async logging fix, and bot exception-path fixes.
0132 
0133 Phase 2 should add broader observability: health endpoint, watchdog, timing
0134 logs, and documentation cleanup.