Skip to content

AEGIS Domain Integration Templates

Version: 1.0.0 | Updated: 2026-02-12 | Status: Active

Four worked examples showing how to map domain-specific metrics to AEGIS parameters. Each template includes a scenario, parameter mapping, complete JSON input, and expected decision walkthrough.

Prerequisite: Read Parameter Reference for detailed parameter semantics.

Interactive access: These templates are also available via the aegis_get_scoring_guide MCP tool — call with domain set to trading, cicd, moderation, agents, or generic.


Template 1: Algorithmic Trading

Scenario

A quantitative trading team wants to deploy a new mean-reversion strategy on the S&P 500 E-mini futures market. The strategy has been backtested for 2 years but has never been traded live. Current portfolio risk is moderate.

Parameter Mapping

AEGIS Parameter Domain Metric Derivation
risk_baseline Current portfolio VaR / limit $45K daily VaR / $500K limit = 0.09
risk_proposed Projected portfolio VaR / limit $78K projected VaR / $500K limit = 0.156
profit_baseline Current Sharpe ratio 1.2 (trailing 6-month)
profit_proposed Backtest Sharpe ratio 1.8 (2-year backtest)
novelty_score 1 - cosine_sim(strategy, nearest) New market regime = 0.65
complexity_score 1 - (instruments * markets / max) 1 instrument, 1 market = 0.9
quality_score Backtest quality composite [data_quality=0.85, code_review=0.9, stress_test=0.8] avg = 0.85
estimated_impact Position sizing < 10% of portfolio = "medium"

JSON Input

{
  "proposal_summary": "Deploy mean-reversion strategy on ES futures (backtest Sharpe 1.8, 2yr)",
  "estimated_impact": "medium",
  "risk_baseline": 0.09,
  "risk_proposed": 0.156,
  "profit_baseline": 1.2,
  "profit_proposed": 1.8,
  "novelty_score": 0.65,
  "complexity_score": 0.9,
  "quality_score": 0.85,
  "quality_subscores": [0.85, 0.9, 0.8],
  "agent_id": "quant-desk-deployer",
  "reversible": true,
  "drift_baseline_data": [0.08, 0.09, 0.07, 0.11, 0.09, 0.08, 0.10, 0.09, 0.12, 0.08,
                          0.09, 0.07, 0.10, 0.09, 0.08, 0.11, 0.09, 0.10, 0.08, 0.09,
                          0.07, 0.10, 0.11, 0.09, 0.08, 0.09, 0.10, 0.08, 0.09, 0.11]
}

Expected Decision Walkthrough

Gate Value Threshold Result
Risk delta = (0.156-0.09)/0.09 = 0.73 P(delta >= 2.0) < 0.95 PASS — risk increased but not by 2x
Profit delta = (1.8-1.2)/1.2 = 0.5 Performance improved PASS — positive improvement
Novelty G(0.65) = 1/(1+exp(-10*(0.65-0.7))) ≈ 0.38 0.38 < 0.8 FAIL — novelty below threshold
Complexity 0.9 >= 0.5 Floor check PASS — well above floor
Quality 0.85 >= 0.7, no zeros Min score + subscore check PASS
Drift KL against baseline KL < 0.3 expected PASS — within normal range

Expected status: PAUSE (novelty gate fails — proposal lacks sufficient novelty)

The novelty gate fails because G(0.65) ≈ 0.38, which is below the 0.8 threshold. This means the proposal is not sufficiently novel. Since estimated_impact is "medium", the decision pauses rather than escalates. The next_steps will recommend reviewing the novelty assessment.

Mitigation: If the strategy is genuinely novel (e.g., new market regime), increase novelty_score to 0.85+ to reflect the true novelty level. If the strategy is routine, the PAUSE is appropriate governance.


Template 2: CI/CD Pipeline Deployment

Scenario

A platform team is deploying a database migration that changes the schema of the users table (5M rows) across 3 microservices. The deployment requires a 10-minute maintenance window. Recent deployment error rate has been stable at 2%.

Parameter Mapping

AEGIS Parameter Domain Metric Derivation
risk_baseline Current error rate 0.02 (2% error rate)
risk_proposed Estimated post-deploy error rate 0.08 (8% during migration window)
profit_baseline Deploy throughput (deploys/day) 12 deploys/day normalized: 12/50 = 0.24
profit_proposed Expected post-migration throughput 14 deploys/day normalized: 14/50 = 0.28
novelty_score Change type classification Schema migration = 0.6
complexity_score 1 - (services * tables / max) 1 - (3 * 1 / 20) = 0.85
quality_score CI pipeline metrics [test_pass=0.98, lint=1.0, review=0.9] avg = 0.96
estimated_impact Services affected 3 services = "high"

JSON Input

{
  "proposal_summary": "Schema migration: users table (5M rows), 3 services, 10-min maintenance window",
  "estimated_impact": "high",
  "risk_baseline": 0.02,
  "risk_proposed": 0.08,
  "profit_baseline": 0.24,
  "profit_proposed": 0.28,
  "novelty_score": 0.6,
  "complexity_score": 0.85,
  "quality_score": 0.96,
  "quality_subscores": [0.98, 1.0, 0.9],
  "agent_id": "deploy-bot-prod",
  "reversible": false,
  "requires_human_approval": true,
  "time_sensitive": true
}

Expected Decision Walkthrough

Gate Value Threshold Result
Risk delta = (0.08-0.02)/0.02 = 3.0 P(delta >= 2.0) > 0.95 FAIL — risk more than doubled
Profit delta = (0.28-0.24)/0.24 = 0.17 Performance improved PASS
Novelty G(0.6) ≈ 0.27 0.27 < 0.8 FAIL — insufficient novelty
Complexity 0.85 >= 0.5 Floor check PASS
Quality 0.96 >= 0.7, no zeros Min score + subscore check PASS

Expected status: ESCALATE (high impact + multiple gate failures)

Both the risk gate and novelty gate fail. The risk gate fails because the error rate quadrupled (3x > trigger_factor 2.0). The novelty gate fails because G(0.6) ≈ 0.27 < 0.8. Because estimated_impact is "high", the decision escalates rather than just pausing. The next_steps will include "Obtain human approval" (from requires_human_approval=true) and rollback planning guidance (from reversible=false).


Template 3: Content Moderation Policy Update

Scenario

A trust & safety team is updating the content moderation policy to add a new category for AI-generated misinformation. This is a new policy area with no direct precedent. The team has high confidence in the rule definitions but limited data on false positive rates.

Parameter Mapping

AEGIS Parameter Domain Metric Derivation
risk_baseline Current false positive rate 0.03 (3% FPR)
risk_proposed Estimated FPR with new rules 0.05 (5% estimated, uncertain)
profit_baseline Precision 0.95
profit_proposed Expected precision 0.92 (slightly lower due to new category)
novelty_score Policy precedent No precedent = 0.85
complexity_score 1 - (rules / total_rules) 1 - (15 new / 200 total) = 0.925
quality_score Review composite [rule_clarity=0.9, legal_review=0.8, annotator_agreement=0.75] avg = 0.82
estimated_impact User base affected All users = "critical"

JSON Input

{
  "proposal_summary": "Add AI-generated misinformation category to content moderation policy (15 new rules)",
  "estimated_impact": "critical",
  "risk_baseline": 0.03,
  "risk_proposed": 0.05,
  "profit_baseline": 0.95,
  "profit_proposed": 0.92,
  "novelty_score": 0.85,
  "complexity_score": 0.925,
  "quality_score": 0.82,
  "quality_subscores": [0.9, 0.8, 0.75],
  "agent_id": "trust-safety-reviewer",
  "reversible": true,
  "requires_human_approval": true
}

Expected Decision Walkthrough

Gate Value Threshold Result
Risk delta = (0.05-0.03)/0.03 = 0.67 P(delta >= 2.0) < 0.95 PASS — risk increase below trigger
Profit delta = (0.92-0.95)/0.95 = -0.03 Small decrease PASS — decrease is minimal
Novelty G(0.85) ≈ 0.82 0.82 >= 0.8 PASS — high novelty meets threshold
Complexity 0.925 >= 0.5 Floor check PASS
Quality 0.82 >= 0.7, no zeros Min score + subscore check PASS

Expected status: ESCALATE (critical impact — all gates pass but critical always escalates)

All gates pass, including the novelty gate — G(0.85) ≈ 0.82 barely clears the 0.8 threshold, meaning the proposal demonstrates sufficient novelty. However, with estimated_impact=critical, the decision automatically escalates regardless of gate outcomes. The rationale will note that all gates passed but human review is required due to critical blast radius.

Mitigation: Deploy to 5% of traffic first (reduces estimated_impact to "low"), then gradually expand. With all gates passing, a low-impact deployment would receive PROCEED.


Template 4: Autonomous Agent Self-Governance

Scenario

An autonomous coding agent (e.g., Claude Code, Codex) is evaluating whether to proceed with a large refactoring task that touches 15 files across 3 modules. The agent has been tracking its own success/failure rates for the past month and wants to use shadow mode for calibration.

Parameter Mapping

AEGIS Parameter Domain Metric Derivation
risk_baseline Recent failure rate 0.05 (5% task failure rate)
risk_proposed Estimated failure rate for this task 0.12 (complex refactor)
profit_baseline Code quality before 0.7 (maintainability index)
profit_proposed Expected quality after 0.85 (cleaner architecture)
novelty_score Action confidence 1 - 0.75 confidence = 0.25 (familiar pattern)
complexity_score 1 - (files * modules / max) 1 - (15 * 3 / 100) = 0.55
quality_score Self-assessment [plan_coherence=0.8, test_coverage=0.7, safety=0.9] avg = 0.8
estimated_impact Files changed 15 files, 3 modules = "high"

JSON Input (Shadow Mode Calibration)

{
  "proposal_summary": "Refactor authentication module: 15 files across 3 modules for improved maintainability",
  "estimated_impact": "high",
  "risk_baseline": 0.05,
  "risk_proposed": 0.12,
  "profit_baseline": 0.7,
  "profit_proposed": 0.85,
  "novelty_score": 0.25,
  "complexity_score": 0.55,
  "quality_score": 0.8,
  "quality_subscores": [0.8, 0.7, 0.9],
  "agent_id": "claude-code-refactor",
  "shadow_mode": true,
  "reversible": true,
  "drift_baseline_data": [0.04, 0.06, 0.05, 0.03, 0.07, 0.05, 0.04, 0.06, 0.05, 0.04,
                          0.05, 0.03, 0.06, 0.05, 0.04, 0.07, 0.05, 0.06, 0.04, 0.05,
                          0.03, 0.05, 0.06, 0.04, 0.05, 0.07, 0.05, 0.04, 0.06, 0.05,
                          0.04, 0.05, 0.03]
}

Expected Decision Walkthrough

Gate Value Threshold Result
Risk delta = (0.12-0.05)/0.05 = 1.4 P(delta >= 2.0) < 0.95 PASS — increased but below 2x
Profit delta = (0.85-0.7)/0.7 = 0.21 Performance improved PASS
Novelty G(0.25) ≈ 0.01 0.01 < 0.8 FAIL — insufficient novelty (familiar pattern scores low)
Complexity 0.55 >= 0.5 Floor check PASS — barely above floor
Quality 0.8 >= 0.7, no zeros Min score + subscore check PASS
Drift KL against baseline KL < 0.3 expected PASS

Expected status: ESCALATE (high impact + novelty gate failure — but in shadow mode, advisory only)

The novelty gate fails because G(0.25) ≈ 0.01 is well below the 0.8 threshold. A low novelty score means the proposal lacks sufficient novelty — the agent's high confidence (0.75) translates to low novelty (0.25), which the gate interprets as insufficient. With estimated_impact=high and a failing gate, the decision escalates. Because shadow_mode=true, the decision is advisory. The response includes a shadow_result object with: - shadow_only: true - observation_values: current metric values for future drift comparison - baseline_hash: SHA-256 of the baseline data (tamper detection)

Calibration insight: The ESCALATE result reveals that mapping novelty_score = 1 - confidence may be too aggressive for routine refactoring tasks. If the agent is confident, that should not trigger escalation. Consider mapping novelty differently — e.g., only use high novelty scores for genuinely novel actions (new language, unfamiliar codebase, first-time architecture change).

Calibration workflow: 1. Run in shadow mode for 30+ days, collecting decisions 2. Review shadow results — are the gates producing sensible outcomes? 3. Adjust estimated_impact or metric derivations if gates are too strict/lenient (e.g., this example suggests refining the novelty mapping) 4. Switch shadow_mode to false to enforce decisions

Note on complexity_score: At 0.55, this is barely above the 0.5 floor. If the refactor scope grows to 20+ files, recalculate — dropping below 0.5 would cause an unoverridable HALT.


Common Integration Patterns

Pattern: Gradual Rollout

Start with shadow mode, then enforce on low-risk changes first:

Phase 1 (Week 1-4):   shadow_mode=true,  all proposals
Phase 2 (Week 5-8):   shadow_mode=false, estimated_impact=low only
Phase 3 (Week 9-12):  shadow_mode=false, low + medium
Phase 4 (Week 13+):   shadow_mode=false, all proposals

Pattern: Pre-flight Check

Call aegis_check_thresholds before submitting a proposal to understand what gate values will be evaluated:

{"method": "tools/call", "params": {"name": "aegis_check_thresholds", "arguments": {}}}

Pattern: Quick Risk Guard

For simple actions where only risk matters, use the simplified API:

{"method": "tools/call", "params": {"name": "aegis_quick_risk_check", "arguments": {"action_description": "Delete staging database", "risk_score": 0.8}}}

This returns safe: false (0.8 >= 0.5 threshold) without full gate evaluation. Use aegis_evaluate_proposal for actual governance decisions.


References