EPCC Implementation Plan: GAP-PerfTest¶

Issue: #2 - GAP-PerfTest: Load-Test Guardrail Service (<500 ms p95) Status: Planning Owner: DevOps Created: 2025-12-26 Target Completion: Before β-to-Prod milestone (2025-03-31)

Phase 1: Research & Validation¶

1.1 Research Findings¶

Source	Key Finding	Applied To
AWS SageMaker Load Testing Best Practices	Benchmark single instance first, then extrapolate; use CloudWatch metrics for TPS and latency	Testing methodology
Performance Testing for AI Applications	Latency validation, scalability checks, resource efficiency, throughput benchmarking	Test objectives
AI Inference Optimization	Batch processing, model optimization, GPU utilization monitoring	Performance tuning
LLM Load Balancing	Track token generation state, adapt to varying workloads, automate health checks	Architecture patterns

1.2 Industry Standards¶

ML Inference Latency Benchmarks:

Use Case	Typical p95 Target	Typical p99 Target
Real-time trading	< 10 ms	< 50 ms
Fraud detection	< 100 ms	< 200 ms
Recommendation systems	< 200 ms	< 500 ms
Risk scoring (our case)	< 500 ms	< 1000 ms

Current target p95 < 500 ms is appropriate for risk scoring workloads.

1.3 Load Testing Approaches (Evaluated)¶

Approach	Pros	Cons	Recommendation
Locust	Python-native, scriptable, distributed	Requires custom setup	Use for flexibility
k6	Modern, cloud-native, good metrics	JavaScript-based	Alternative option
Artillery	YAML config, easy setup	Less flexible	Quick validation
AWS Load Testing	Native integration	AWS-specific	Use for production validation

Selected Approach: Locust for development, AWS Distributed Load Testing for production validation

Phase 2: Architecture Overview¶

2.1 High-Level Design¶

┌─────────────────────────────────────────────────────────────────┐
│                    LOAD TESTING ARCHITECTURE                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐       │
│  │   Load Gen   │    │   Guardrail  │    │   Metrics    │       │
│  │   (Locust)   │───▶│   Service    │───▶│  Collector   │       │
│  │              │    │              │    │ (CloudWatch) │       │
│  └──────────────┘    └──────────────┘    └──────┬───────┘       │
│         │                                        │               │
│         │            ┌───────────────────────────┘               │
│         ▼            ▼                                           │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    METRICS DASHBOARD                     │    │
│  │  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐     │    │
│  │  │ p50 lat │  │ p95 lat │  │ p99 lat │  │   TPS   │     │    │
│  │  │  <100ms │  │  <500ms │  │ <1000ms │  │  >100/s │     │    │
│  │  └─────────┘  └─────────┘  └─────────┘  └─────────┘     │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

2.2 Performance Targets¶

Metric	Target	Measurement	Alert Threshold
p50 latency	< 100 ms	CloudWatch	> 150 ms
p95 latency	< 500 ms	CloudWatch	> 500 ms
p99 latency	< 1000 ms	CloudWatch	> 1000 ms
Throughput	> 100 req/s	CloudWatch	< 80 req/s
Error rate	< 0.1%	CloudWatch	> 0.5%
CPU utilization	< 70%	CloudWatch	> 80%
Memory utilization	< 80%	CloudWatch	> 85%

2.3 ADR: Load Testing Strategy¶

ADR-002: Performance Testing Strategy

Status: Proposed

Context: Need to validate guardrail service meets p95 < 500 ms under production load.

Decision Drivers: - Must simulate realistic production traffic patterns - Need to identify bottlenecks before production - Require repeatable, automated tests

Decision: Use staged load testing approach: 1. Baseline: Single instance, low load (10 RPS) 2. Scale: Increase to production estimate (100+ RPS) 3. Stress: Push beyond expected peak (2x production) 4. Soak: Extended duration (4+ hours) at production load

Consequences: - Comprehensive performance profile - Early bottleneck identification - Requires dedicated test environment

Phase 3: Implementation Strategy¶

3.1 Prerequisites¶

Prerequisite	Status	Owner	Notes
Shadow scoring service deployed	Required	DevOps	Target environment
Test environment isolated	Required	DevOps	Avoid production impact
CloudWatch metrics configured	Required	DevOps	Latency histograms
Sample payload corpus	Required	Data Eng	Representative proposals

3.2 Milestones¶

Milestone 1: Test Infrastructure Setup (Week 1)¶

[ ] 1.1 Deploy Locust cluster in test environment
[ ] 1.2 Configure CloudWatch dashboard for latency metrics
[ ] 1.3 Create sample payload corpus (100+ representative proposals)
[ ] 1.4 Document baseline instance configuration

Exit Criteria: Locust can send requests to guardrail service and metrics appear in CloudWatch

Milestone 2: Baseline Performance (Week 2)¶

[ ] 2.1 Run baseline test: 10 RPS for 10 minutes

[ ] 2.2 Record baseline metrics:

p50 latency:    ___ ms
p95 latency:    ___ ms
p99 latency:    ___ ms
Max latency:    ___ ms
Error rate:     ___%
CPU util:       ___%
Memory util:    ___%

[ ] 2.3 Identify cold start behavior
[ ] 2.4 Document single-instance capacity

Exit Criteria: Baseline metrics documented, no errors at 10 RPS

Milestone 3: Scale Testing (Week 3)¶

[ ] 3.1 Ramp test: 10 → 50 → 100 → 150 RPS over 30 minutes
[ ] 3.2 Identify inflection points where latency degrades
[ ] 3.3 Record metrics at each load level
[ ] 3.4 Calculate required instance count for production

Exit Criteria: Understand scaling characteristics, target TPS achievable

Milestone 4: Stress & Soak Testing (Week 4)¶

[ ] 4.1 Stress test: 2x expected peak load for 30 minutes
[ ] 4.2 Soak test: Production load for 4+ hours
[ ] 4.3 Monitor for memory leaks, connection exhaustion
[ ] 4.4 Document failure modes and recovery behavior

Exit Criteria: No degradation over extended periods, graceful failure under stress

Milestone 5: Optimization & Report (Week 5)¶

[ ] 5.1 Address identified bottlenecks:
[ ] Add caching if model loading is slow
[ ] Optimize computation if CPU-bound
[ ] Add connection pooling if I/O-bound
[ ] 5.2 Re-run tests to validate improvements
[ ] 5.3 Create performance test report
[ ] 5.4 Establish performance regression tests for CI/CD

Exit Criteria: p95 < 500 ms validated, performance report published

3.3 Risk Register¶

Risk	L	I	R	Mitigation
Test environment differs from prod	M	H	6	Use identical instance types
Payload corpus not representative	M	M	4	Sample from real proposals
Network latency skews results	L	M	2	Run Locust in same VPC
Cold start affects measurements	M	L	2	Warm up before measuring

Phase 4: Technical Excellence¶

4.1 Locust Test Script (Python)¶

"""
Guardrail Service Load Test
Issue #2: GAP-PerfTest
"""

from locust import HttpUser, task, between
import json
import random

# Sample payloads representing realistic proposals
SAMPLE_PAYLOADS = [
    {
        "proposal_id": "test-001",
        "risk_base": 0.3,
        "profit_base": 100000,
        "risk_prop": 0.35,
        "profit_prop": 120000,
        "novelty_score": 0.6,
        "complexity_score": 0.7,
        "quality_score": 0.8
    },
    # Add more representative payloads...
]


class GuardrailUser(HttpUser):
    """Simulates user submitting proposals for guardrail evaluation."""

    wait_time = between(0.1, 0.5)  # 2-10 requests per second per user

    @task(10)
    def evaluate_proposal(self):
        """Primary task: evaluate a proposal."""
        payload = random.choice(SAMPLE_PAYLOADS)
        payload["proposal_id"] = f"load-test-{random.randint(1, 1000000)}"

        with self.client.post(
            "/evaluate",
            json=payload,
            catch_response=True
        ) as response:
            if response.status_code == 200:
                result = response.json()
                if "guardrail_decision" in result:
                    response.success()
                else:
                    response.failure("Missing guardrail_decision in response")
            else:
                response.failure(f"Status code: {response.status_code}")

    @task(1)
    def health_check(self):
        """Occasional health check."""
        self.client.get("/health")


class StressUser(HttpUser):
    """High-frequency user for stress testing."""

    wait_time = between(0.01, 0.05)  # 20-100 requests per second per user

    @task
    def rapid_evaluate(self):
        """Rapid-fire evaluations."""
        payload = random.choice(SAMPLE_PAYLOADS)
        self.client.post("/evaluate", json=payload)

4.2 CloudWatch Dashboard Configuration¶

{
  "widgets": [
    {
      "type": "metric",
      "properties": {
        "title": "Response Time Percentiles",
        "metrics": [
          ["Guardrail", "Latency", "Percentile", "p50"],
          ["Guardrail", "Latency", "Percentile", "p95"],
          ["Guardrail", "Latency", "Percentile", "p99"]
        ],
        "period": 60,
        "stat": "Average",
        "annotations": {
          "horizontal": [
            { "value": 500, "label": "p95 Target", "color": "#ff0000" }
          ]
        }
      }
    },
    {
      "type": "metric",
      "properties": {
        "title": "Throughput (TPS)",
        "metrics": [
          ["Guardrail", "RequestCount", "Service", "Evaluator"]
        ],
        "period": 60,
        "stat": "Sum"
      }
    },
    {
      "type": "metric",
      "properties": {
        "title": "Error Rate",
        "metrics": [
          ["Guardrail", "ErrorCount", "Service", "Evaluator"]
        ],
        "period": 60,
        "stat": "Sum"
      }
    }
  ]
}

4.3 Performance Report Template¶

# Guardrail Service Performance Report

**Test Date**: 2026-01-15 (example — replace with actual test date)
**Environment**: [Staging/Production]
**Service Version**: [version]

## Summary

| Metric | Target | Achieved | Status |
|--------|--------|----------|--------|
| p95 Latency | < 500 ms | ___ ms | ✅/❌ |
| p99 Latency | < 1000 ms | ___ ms | ✅/❌ |
| Max Throughput | > 100 RPS | ___ RPS | ✅/❌ |
| Error Rate | < 0.1% | ___% | ✅/❌ |

## Test Scenarios

### Baseline (10 RPS)
- Duration: 10 minutes
- Results: [table of metrics]

### Scale (100 RPS)
- Duration: 30 minutes
- Results: [table of metrics]

### Stress (200 RPS)
- Duration: 30 minutes
- Results: [table of metrics]

### Soak (100 RPS, 4 hours)
- Duration: 4 hours
- Results: [table of metrics]

## Bottlenecks Identified

1. [Description of bottleneck]
   - Impact: [latency impact]
   - Resolution: [fix applied]

## Recommendations

1. [Optimization recommendation]
2. [Scaling recommendation]

## Appendix: Raw Data

[Link to detailed metrics export]

Phase 5: Development Workflow¶

5.1 Test Execution Commands¶

# Start Locust with web UI
locust -f load_test.py --host=https://aegis-staging.acme-corp.test

# Headless baseline test
locust -f load_test.py \
  --host=https://aegis-staging.acme-corp.test \
  --headless \
  --users 10 \
  --spawn-rate 1 \
  --run-time 10m \
  --csv=baseline

# Headless scale test
locust -f load_test.py \
  --host=https://aegis-staging.acme-corp.test \
  --headless \
  --users 100 \
  --spawn-rate 10 \
  --run-time 30m \
  --csv=scale

5.2 Testing Strategy¶

Test Type	Description	Pass Criteria
Baseline	Low load, single instance	p95 < 200 ms
Scale	Production load estimate	p95 < 500 ms
Stress	2x production peak	Graceful degradation
Soak	Extended duration	No memory leaks
Spike	Sudden load increase	Recovery < 30s

5.3 CI/CD Integration¶

# GitHub Actions workflow for performance regression
name: Performance Regression Test

on:
  push:
    branches: [main]
  schedule:
    - cron: '0 2 * * 1'  # Weekly Monday 2 AM

jobs:
  perf-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run Performance Test
        run: |
          pip install locust
          locust -f tests/perf/load_test.py \
            --host=${{ secrets.STAGING_URL }} \
            --headless \
            --users 50 \
            --spawn-rate 5 \
            --run-time 5m \
            --csv=results

      - name: Check p95 Latency
        run: |
          P95=$(tail -1 results_stats.csv | cut -d',' -f8)
          if [ "$P95" -gt 500 ]; then
            echo "p95 latency $P95 ms exceeds 500 ms target"
            exit 1
          fi

Phase 6: Success Criteria & Metrics¶

6.1 Definition of Done¶

[ ] p95 latency < 500 ms validated under production load
[ ] Performance test report published
[ ] Bottlenecks identified and addressed
[ ] Performance regression tests in CI/CD
[ ] CloudWatch dashboard configured
[ ] Runbook for performance troubleshooting created

6.2 Success Metrics¶

Metric	Target	Measurement
p95 latency	< 500 ms	Load test results
Throughput capacity	> 100 RPS	Load test results
Error rate under load	< 0.1%	Load test results
Performance regression	< 10%	CI/CD comparison

6.3 Quality Gates¶

Gate	Requirement	Verification
Baseline complete	p95 documented	Test report
Scale test pass	p95 < 500 ms at 100 RPS	Test report
Stress test pass	Graceful degradation	Test report
Soak test pass	No leaks over 4 hours	Memory monitoring
CI/CD integration	Automated regression	Pipeline pass

Appendix: Research Sources¶

AWS Machine Learning Blog. "Best practices for load testing Amazon SageMaker real-time inference endpoints." https://aws.amazon.com/blogs/machine-learning/best-practices-for-load-testing-amazon-sagemaker-real-time-inference-endpoints/
Testriq. "Performance Testing for AI Applications: Speed, Scalability & Reliability at Scale." https://www.testriq.com/blog/post/performance-testing-for-ai-applications
RunPod. "AI Inference Optimization: Achieving Maximum Throughput with Minimal Latency." https://www.runpod.io/articles/guides/ai-inference-optimization-achieving-maximum-throughput-with-minimal-latency
TrueFoundry. "LLM Load Balancing." https://www.truefoundry.com/blog/llm-load-balancing

Changelog¶

Version	Date	Author	Changes
1.0.0	2025-12-26	Claude Code	Initial EPCC plan