PROD Hermes Workspace
76 pages 75 clean 1 need attention
Page navigation 76

Project

Domains

1

Folders · apps/

Folders · docs/

Folders · knowledge/

Folders · ops/

Folders · packages/

Folders · scripts/

Folders · tests/

Folders · pods/

Folders · tools/

Shared

Metadata clean
Route
/knowledge/portal/shared/capacity-and-cost-guardrails
Source
knowledge/portal/shared/capacity-and-cost-guardrails.md
Covered files
5
Last generated
2026-04-13T15:08:52.627606+00:00

Capacity and cost guardrails

State: deterministic sync completed

Capacity and cost guardrails

Purpose

Define the single-source contract for ScrumAI runtime capacity and spend protections exposed by Portal control-plane routes.

Canonical architecture (no hidden second architecture)

  • Runtime truth is apps/portal/app.py backed by Portal/TzenBoard runtime signals.
  • Contract exposure is /api/scrumai/*.
  • This contract does not introduce a second scheduler, worker fleet, or hidden control-plane.

Contract surfaces

  • GET /api/scrumai/guardrails
  • Contract: scrumai_capacity_cost_guardrails_v1
  • Purpose: explicit limits, behavior contract, environment expectations, and runtime mapping.
  • GET /api/scrumai/queue/state
  • Contract: scrumai_queue_state_v1
  • Purpose: dispatch readiness + summary.monthly_cost_guard.
  • GET /api/scrumai/agents/<agent_key>/runtime
  • Contract: scrumai_role_runtime_v1
  • Purpose: per-role dispatch readiness + monthly_cost_guard snapshot.
  • GET /api/scrumai/bootstrap
  • Contract: scrumai_control_plane_bootstrap_v1
  • Purpose: minimum control-plane route map including guardrails surface.

Hard limits (runtime-enforced contract)

Exposed via /api/scrumai/guardrails.hard_limits:

  • global_concurrency = 1
  • single_fallback_max_attempts = 2 (one primary + one fallback)
  • failure_max_attempts = 1 for timeout/crash terminal path
  • circuit_breaker_threshold from PORTAL_SCRUMAI_CIRCUIT_BREAKER_THRESHOLD (default 3)
  • live dispatch timeout bounds: 15..900 seconds
  • SCRUMAI_APPLY write scope allowlisted to docs/evidence/generated

Alert / block / fallback behavior contract

The runtime must name these behaviors explicitly (not implicit prose only):

| Behavior | Current trigger class | Current observable surfaces | Expected outcome | |---|---|---|---| | alert | blocked dispatch, failed terminal dispatch, breached monthly ceiling | /api/scrumai/hub latest dispatch issue; /api/scrumai/queue/state blocked reason code; /api/scrumai/guardrails monthly cost state | machine-readable reason code is visible to operators | | block | lane health gate failure, telemetry write failure, allowlist rejection, cost breach, breaker trip | /api/scrumai/queue/state, /api/scrumai/agents/<agent_key>/runtime, /api/scrumai/agents/<agent_key>/runtime/start | dispatch denied or terminalized with explicit reason | | fallback | primary dispatch failure on fallback-enabled lane | /api/scrumai/agents/<agent_key>/runtime/start and .../sync dispatch audit | max one fallback attempt, then terminal/blocked outcome |

Environment expectations

| Profile | Intended use | Guardrail expectation | |---|---|---| | prod | live operator lane | enforce strict single concurrency and bounded retry/fallback; expect monthly ceiling to be configured; treat force_dispatch as emergency-only override with audit | | dev_test | migration/test/proof lane | same hard safety caps (single concurrency + bounded retry/fallback), but ceiling may be disabled (0.0) and controlled force_dispatch can be used to exercise guardrails |

Environment resolution is exposed via /api/scrumai/guardrails.environment_expectations. Unknown APP_ENV values currently default to the dev_test profile and are marked explicitly in the payload.

Runtime knob mapping (contract -> live runtime)

Exposed via /api/scrumai/guardrails.runtime_knob_mapping:

| Contract knob | Runtime source today | Operator-tunable now | |---|---|---| | max_parallel_jobs | fixed in app.py (global_concurrency=1) | no | | single_fallback_max_attempts | SCRUMAI_SINGLE_FALLBACK_MAX_ATTEMPTS constant | no | | failure_max_attempts | SCRUMAI_FAILURE_MAX_ATTEMPTS constant | no | | circuit_breaker_threshold | PORTAL_SCRUMAI_CIRCUIT_BREAKER_THRESHOLD | yes | | monthly_cost_window_days | PORTAL_SCRUMAI_MONTHLY_COST_WINDOW_DAYS | yes | | monthly_cost_ceiling_usd | PORTAL_SCRUMAI_MONTHLY_COST_CEILING_USD | yes | | monthly_cost_review_day_utc | PORTAL_SCRUMAI_MONTHLY_COST_REVIEW_DAY_UTC | yes | | live_dispatch_timeout_seconds | live dispatch timeout clamp in app.py (15..900) | yes (request/env path) |

Known gaps (explicitly tracked)

Exposed via /api/scrumai/guardrails.known_gaps:

  • alert_delivery_sink_missing: alert semantics are visible in API contracts, but no pager/webhook sink is wired yet.
  • prod_force_dispatch_policy_not_hard_enforced: prod policy is documented as emergency-only, but runtime currently trusts caller force_dispatch + audit trail.
  • global_concurrency_not_operator_tunable: max parallel jobs remains fixed at 1 in code.

Monthly cost guard and review contract

Runtime ceiling controls

  • PORTAL_SCRUMAI_MONTHLY_COST_WINDOW_DAYS (default 30)
  • PORTAL_SCRUMAI_MONTHLY_COST_CEILING_USD (default 0.0 = disabled)
  • PORTAL_SCRUMAI_MONTHLY_COST_REVIEW_DAY_UTC (1..28, default 1)

Block behavior

  • Cost snapshot is computed from inference_events over configured UTC rolling window.
  • When ceiling_usd > 0 and spend_usd >= ceiling_usd, dispatch is marked blocked on queue/runtime surfaces.
  • Block reason includes MONTHLY_COST_CEILING_REACHED.
  • Forced dispatch remains explicit operator override via runtime request payload (telemetry.force_dispatch).

Monthly review process

1. Confirm spend and remote spend for current window. 2. Decide if ceiling/env knobs need adjustment for next month. 3. Record decision, owner, and timestamp in the evidence artifact.

  • Cadence: monthly on configured UTC review day.
  • Owners: Product Owner, Scrum Master, Dan.
  • Required inputs:
  • GET /api/scrumai/guardrails (monthly_cost_guard, review due fields)
  • inference_events spend totals for local/remote split
  • Required evidence artifact pattern:
  • docs/evidence/task-245-monthly-cost-review-YYYY-MM.md
  • Required checks: