AEGIS Domain Integration Templates¶
Version: 1.0.0 | Updated: 2026-02-12 | Status: Active
Four worked examples showing how to map domain-specific metrics to AEGIS parameters. Each template includes a scenario, parameter mapping, complete JSON input, and expected decision walkthrough.
Prerequisite: Read Parameter Reference for detailed parameter semantics.
Interactive access: These templates are also available via the aegis_get_scoring_guide MCP tool — call with domain set to trading, cicd, moderation, agents, or generic.
Template 1: Algorithmic Trading¶
Scenario¶
A quantitative trading team wants to deploy a new mean-reversion strategy on the S&P 500 E-mini futures market. The strategy has been backtested for 2 years but has never been traded live. Current portfolio risk is moderate.
Parameter Mapping¶
| AEGIS Parameter | Domain Metric | Derivation |
|---|---|---|
risk_baseline | Current portfolio VaR / limit | $45K daily VaR / $500K limit = 0.09 |
risk_proposed | Projected portfolio VaR / limit | $78K projected VaR / $500K limit = 0.156 |
profit_baseline | Current Sharpe ratio | 1.2 (trailing 6-month) |
profit_proposed | Backtest Sharpe ratio | 1.8 (2-year backtest) |
novelty_score | 1 - cosine_sim(strategy, nearest) | New market regime = 0.65 |
complexity_score | 1 - (instruments * markets / max) | 1 instrument, 1 market = 0.9 |
quality_score | Backtest quality composite | [data_quality=0.85, code_review=0.9, stress_test=0.8] avg = 0.85 |
estimated_impact | Position sizing | < 10% of portfolio = "medium" |
JSON Input¶
{
"proposal_summary": "Deploy mean-reversion strategy on ES futures (backtest Sharpe 1.8, 2yr)",
"estimated_impact": "medium",
"risk_baseline": 0.09,
"risk_proposed": 0.156,
"profit_baseline": 1.2,
"profit_proposed": 1.8,
"novelty_score": 0.65,
"complexity_score": 0.9,
"quality_score": 0.85,
"quality_subscores": [0.85, 0.9, 0.8],
"agent_id": "quant-desk-deployer",
"reversible": true,
"drift_baseline_data": [0.08, 0.09, 0.07, 0.11, 0.09, 0.08, 0.10, 0.09, 0.12, 0.08,
0.09, 0.07, 0.10, 0.09, 0.08, 0.11, 0.09, 0.10, 0.08, 0.09,
0.07, 0.10, 0.11, 0.09, 0.08, 0.09, 0.10, 0.08, 0.09, 0.11]
}
Expected Decision Walkthrough¶
| Gate | Value | Threshold | Result |
|---|---|---|---|
| Risk | delta = (0.156-0.09)/0.09 = 0.73 | P(delta >= 2.0) < 0.95 | PASS — risk increased but not by 2x |
| Profit | delta = (1.8-1.2)/1.2 = 0.5 | Performance improved | PASS — positive improvement |
| Novelty | G(0.65) = 1/(1+exp(-10*(0.65-0.7))) ≈ 0.38 | 0.38 < 0.8 | FAIL — novelty below threshold |
| Complexity | 0.9 >= 0.5 | Floor check | PASS — well above floor |
| Quality | 0.85 >= 0.7, no zeros | Min score + subscore check | PASS |
| Drift | KL against baseline | KL < 0.3 expected | PASS — within normal range |
Expected status: PAUSE (novelty gate fails — proposal lacks sufficient novelty)
The novelty gate fails because G(0.65) ≈ 0.38, which is below the 0.8 threshold. This means the proposal is not sufficiently novel. Since estimated_impact is "medium", the decision pauses rather than escalates. The next_steps will recommend reviewing the novelty assessment.
Mitigation: If the strategy is genuinely novel (e.g., new market regime), increase novelty_score to 0.85+ to reflect the true novelty level. If the strategy is routine, the PAUSE is appropriate governance.
Template 2: CI/CD Pipeline Deployment¶
Scenario¶
A platform team is deploying a database migration that changes the schema of the users table (5M rows) across 3 microservices. The deployment requires a 10-minute maintenance window. Recent deployment error rate has been stable at 2%.
Parameter Mapping¶
| AEGIS Parameter | Domain Metric | Derivation |
|---|---|---|
risk_baseline | Current error rate | 0.02 (2% error rate) |
risk_proposed | Estimated post-deploy error rate | 0.08 (8% during migration window) |
profit_baseline | Deploy throughput (deploys/day) | 12 deploys/day normalized: 12/50 = 0.24 |
profit_proposed | Expected post-migration throughput | 14 deploys/day normalized: 14/50 = 0.28 |
novelty_score | Change type classification | Schema migration = 0.6 |
complexity_score | 1 - (services * tables / max) | 1 - (3 * 1 / 20) = 0.85 |
quality_score | CI pipeline metrics | [test_pass=0.98, lint=1.0, review=0.9] avg = 0.96 |
estimated_impact | Services affected | 3 services = "high" |
JSON Input¶
{
"proposal_summary": "Schema migration: users table (5M rows), 3 services, 10-min maintenance window",
"estimated_impact": "high",
"risk_baseline": 0.02,
"risk_proposed": 0.08,
"profit_baseline": 0.24,
"profit_proposed": 0.28,
"novelty_score": 0.6,
"complexity_score": 0.85,
"quality_score": 0.96,
"quality_subscores": [0.98, 1.0, 0.9],
"agent_id": "deploy-bot-prod",
"reversible": false,
"requires_human_approval": true,
"time_sensitive": true
}
Expected Decision Walkthrough¶
| Gate | Value | Threshold | Result |
|---|---|---|---|
| Risk | delta = (0.08-0.02)/0.02 = 3.0 | P(delta >= 2.0) > 0.95 | FAIL — risk more than doubled |
| Profit | delta = (0.28-0.24)/0.24 = 0.17 | Performance improved | PASS |
| Novelty | G(0.6) ≈ 0.27 | 0.27 < 0.8 | FAIL — insufficient novelty |
| Complexity | 0.85 >= 0.5 | Floor check | PASS |
| Quality | 0.96 >= 0.7, no zeros | Min score + subscore check | PASS |
Expected status: ESCALATE (high impact + multiple gate failures)
Both the risk gate and novelty gate fail. The risk gate fails because the error rate quadrupled (3x > trigger_factor 2.0). The novelty gate fails because G(0.6) ≈ 0.27 < 0.8. Because estimated_impact is "high", the decision escalates rather than just pausing. The next_steps will include "Obtain human approval" (from requires_human_approval=true) and rollback planning guidance (from reversible=false).
Template 3: Content Moderation Policy Update¶
Scenario¶
A trust & safety team is updating the content moderation policy to add a new category for AI-generated misinformation. This is a new policy area with no direct precedent. The team has high confidence in the rule definitions but limited data on false positive rates.
Parameter Mapping¶
| AEGIS Parameter | Domain Metric | Derivation |
|---|---|---|
risk_baseline | Current false positive rate | 0.03 (3% FPR) |
risk_proposed | Estimated FPR with new rules | 0.05 (5% estimated, uncertain) |
profit_baseline | Precision | 0.95 |
profit_proposed | Expected precision | 0.92 (slightly lower due to new category) |
novelty_score | Policy precedent | No precedent = 0.85 |
complexity_score | 1 - (rules / total_rules) | 1 - (15 new / 200 total) = 0.925 |
quality_score | Review composite | [rule_clarity=0.9, legal_review=0.8, annotator_agreement=0.75] avg = 0.82 |
estimated_impact | User base affected | All users = "critical" |
JSON Input¶
{
"proposal_summary": "Add AI-generated misinformation category to content moderation policy (15 new rules)",
"estimated_impact": "critical",
"risk_baseline": 0.03,
"risk_proposed": 0.05,
"profit_baseline": 0.95,
"profit_proposed": 0.92,
"novelty_score": 0.85,
"complexity_score": 0.925,
"quality_score": 0.82,
"quality_subscores": [0.9, 0.8, 0.75],
"agent_id": "trust-safety-reviewer",
"reversible": true,
"requires_human_approval": true
}
Expected Decision Walkthrough¶
| Gate | Value | Threshold | Result |
|---|---|---|---|
| Risk | delta = (0.05-0.03)/0.03 = 0.67 | P(delta >= 2.0) < 0.95 | PASS — risk increase below trigger |
| Profit | delta = (0.92-0.95)/0.95 = -0.03 | Small decrease | PASS — decrease is minimal |
| Novelty | G(0.85) ≈ 0.82 | 0.82 >= 0.8 | PASS — high novelty meets threshold |
| Complexity | 0.925 >= 0.5 | Floor check | PASS |
| Quality | 0.82 >= 0.7, no zeros | Min score + subscore check | PASS |
Expected status: ESCALATE (critical impact — all gates pass but critical always escalates)
All gates pass, including the novelty gate — G(0.85) ≈ 0.82 barely clears the 0.8 threshold, meaning the proposal demonstrates sufficient novelty. However, with estimated_impact=critical, the decision automatically escalates regardless of gate outcomes. The rationale will note that all gates passed but human review is required due to critical blast radius.
Mitigation: Deploy to 5% of traffic first (reduces estimated_impact to "low"), then gradually expand. With all gates passing, a low-impact deployment would receive PROCEED.
Template 4: Autonomous Agent Self-Governance¶
Scenario¶
An autonomous coding agent (e.g., Claude Code, Codex) is evaluating whether to proceed with a large refactoring task that touches 15 files across 3 modules. The agent has been tracking its own success/failure rates for the past month and wants to use shadow mode for calibration.
Parameter Mapping¶
| AEGIS Parameter | Domain Metric | Derivation |
|---|---|---|
risk_baseline | Recent failure rate | 0.05 (5% task failure rate) |
risk_proposed | Estimated failure rate for this task | 0.12 (complex refactor) |
profit_baseline | Code quality before | 0.7 (maintainability index) |
profit_proposed | Expected quality after | 0.85 (cleaner architecture) |
novelty_score | Action confidence | 1 - 0.75 confidence = 0.25 (familiar pattern) |
complexity_score | 1 - (files * modules / max) | 1 - (15 * 3 / 100) = 0.55 |
quality_score | Self-assessment | [plan_coherence=0.8, test_coverage=0.7, safety=0.9] avg = 0.8 |
estimated_impact | Files changed | 15 files, 3 modules = "high" |
JSON Input (Shadow Mode Calibration)¶
{
"proposal_summary": "Refactor authentication module: 15 files across 3 modules for improved maintainability",
"estimated_impact": "high",
"risk_baseline": 0.05,
"risk_proposed": 0.12,
"profit_baseline": 0.7,
"profit_proposed": 0.85,
"novelty_score": 0.25,
"complexity_score": 0.55,
"quality_score": 0.8,
"quality_subscores": [0.8, 0.7, 0.9],
"agent_id": "claude-code-refactor",
"shadow_mode": true,
"reversible": true,
"drift_baseline_data": [0.04, 0.06, 0.05, 0.03, 0.07, 0.05, 0.04, 0.06, 0.05, 0.04,
0.05, 0.03, 0.06, 0.05, 0.04, 0.07, 0.05, 0.06, 0.04, 0.05,
0.03, 0.05, 0.06, 0.04, 0.05, 0.07, 0.05, 0.04, 0.06, 0.05,
0.04, 0.05, 0.03]
}
Expected Decision Walkthrough¶
| Gate | Value | Threshold | Result |
|---|---|---|---|
| Risk | delta = (0.12-0.05)/0.05 = 1.4 | P(delta >= 2.0) < 0.95 | PASS — increased but below 2x |
| Profit | delta = (0.85-0.7)/0.7 = 0.21 | Performance improved | PASS |
| Novelty | G(0.25) ≈ 0.01 | 0.01 < 0.8 | FAIL — insufficient novelty (familiar pattern scores low) |
| Complexity | 0.55 >= 0.5 | Floor check | PASS — barely above floor |
| Quality | 0.8 >= 0.7, no zeros | Min score + subscore check | PASS |
| Drift | KL against baseline | KL < 0.3 expected | PASS |
Expected status: ESCALATE (high impact + novelty gate failure — but in shadow mode, advisory only)
The novelty gate fails because G(0.25) ≈ 0.01 is well below the 0.8 threshold. A low novelty score means the proposal lacks sufficient novelty — the agent's high confidence (0.75) translates to low novelty (0.25), which the gate interprets as insufficient. With estimated_impact=high and a failing gate, the decision escalates. Because shadow_mode=true, the decision is advisory. The response includes a shadow_result object with: - shadow_only: true - observation_values: current metric values for future drift comparison - baseline_hash: SHA-256 of the baseline data (tamper detection)
Calibration insight: The ESCALATE result reveals that mapping novelty_score = 1 - confidence may be too aggressive for routine refactoring tasks. If the agent is confident, that should not trigger escalation. Consider mapping novelty differently — e.g., only use high novelty scores for genuinely novel actions (new language, unfamiliar codebase, first-time architecture change).
Calibration workflow: 1. Run in shadow mode for 30+ days, collecting decisions 2. Review shadow results — are the gates producing sensible outcomes? 3. Adjust estimated_impact or metric derivations if gates are too strict/lenient (e.g., this example suggests refining the novelty mapping) 4. Switch shadow_mode to false to enforce decisions
Note on complexity_score: At 0.55, this is barely above the 0.5 floor. If the refactor scope grows to 20+ files, recalculate — dropping below 0.5 would cause an unoverridable HALT.
Common Integration Patterns¶
Pattern: Gradual Rollout¶
Start with shadow mode, then enforce on low-risk changes first:
Phase 1 (Week 1-4): shadow_mode=true, all proposals
Phase 2 (Week 5-8): shadow_mode=false, estimated_impact=low only
Phase 3 (Week 9-12): shadow_mode=false, low + medium
Phase 4 (Week 13+): shadow_mode=false, all proposals
Pattern: Pre-flight Check¶
Call aegis_check_thresholds before submitting a proposal to understand what gate values will be evaluated:
Pattern: Quick Risk Guard¶
For simple actions where only risk matters, use the simplified API:
{"method": "tools/call", "params": {"name": "aegis_quick_risk_check", "arguments": {"action_description": "Delete staging database", "risk_score": 0.8}}}
This returns safe: false (0.8 >= 0.5 threshold) without full gate evaluation. Use aegis_evaluate_proposal for actual governance decisions.
References¶
- Parameter Reference — Complete parameter documentation
- Production Guide — Deployment and observability
- Interface Contract — Frozen parameter values