Skip to content

EPCC Implementation Plan: GAP-PerfTest

Issue: #2 - GAP-PerfTest: Load-Test Guardrail Service (<500 ms p95) Status: Planning Owner: DevOps Created: 2025-12-26 Target Completion: Before β-to-Prod milestone (2025-03-31)


Phase 1: Research & Validation

1.1 Research Findings

Source Key Finding Applied To
AWS SageMaker Load Testing Best Practices Benchmark single instance first, then extrapolate; use CloudWatch metrics for TPS and latency Testing methodology
Performance Testing for AI Applications Latency validation, scalability checks, resource efficiency, throughput benchmarking Test objectives
AI Inference Optimization Batch processing, model optimization, GPU utilization monitoring Performance tuning
LLM Load Balancing Track token generation state, adapt to varying workloads, automate health checks Architecture patterns

1.2 Industry Standards

ML Inference Latency Benchmarks:

Use Case Typical p95 Target Typical p99 Target
Real-time trading < 10 ms < 50 ms
Fraud detection < 100 ms < 200 ms
Recommendation systems < 200 ms < 500 ms
Risk scoring (our case) < 500 ms < 1000 ms

Current target p95 < 500 ms is appropriate for risk scoring workloads.

1.3 Load Testing Approaches (Evaluated)

Approach Pros Cons Recommendation
Locust Python-native, scriptable, distributed Requires custom setup Use for flexibility
k6 Modern, cloud-native, good metrics JavaScript-based Alternative option
Artillery YAML config, easy setup Less flexible Quick validation
AWS Load Testing Native integration AWS-specific Use for production validation

Selected Approach: Locust for development, AWS Distributed Load Testing for production validation


Phase 2: Architecture Overview

2.1 High-Level Design

┌─────────────────────────────────────────────────────────────────┐
│                    LOAD TESTING ARCHITECTURE                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐       │
│  │   Load Gen   │    │   Guardrail  │    │   Metrics    │       │
│  │   (Locust)   │───▶│   Service    │───▶│  Collector   │       │
│  │              │    │              │    │ (CloudWatch) │       │
│  └──────────────┘    └──────────────┘    └──────┬───────┘       │
│         │                                        │               │
│         │            ┌───────────────────────────┘               │
│         ▼            ▼                                           │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    METRICS DASHBOARD                     │    │
│  │  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐     │    │
│  │  │ p50 lat │  │ p95 lat │  │ p99 lat │  │   TPS   │     │    │
│  │  │  <100ms │  │  <500ms │  │ <1000ms │  │  >100/s │     │    │
│  │  └─────────┘  └─────────┘  └─────────┘  └─────────┘     │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

2.2 Performance Targets

Metric Target Measurement Alert Threshold
p50 latency < 100 ms CloudWatch > 150 ms
p95 latency < 500 ms CloudWatch > 500 ms
p99 latency < 1000 ms CloudWatch > 1000 ms
Throughput > 100 req/s CloudWatch < 80 req/s
Error rate < 0.1% CloudWatch > 0.5%
CPU utilization < 70% CloudWatch > 80%
Memory utilization < 80% CloudWatch > 85%

2.3 ADR: Load Testing Strategy

ADR-002: Performance Testing Strategy

Status: Proposed

Context: Need to validate guardrail service meets p95 < 500 ms under production load.

Decision Drivers: - Must simulate realistic production traffic patterns - Need to identify bottlenecks before production - Require repeatable, automated tests

Decision: Use staged load testing approach: 1. Baseline: Single instance, low load (10 RPS) 2. Scale: Increase to production estimate (100+ RPS) 3. Stress: Push beyond expected peak (2x production) 4. Soak: Extended duration (4+ hours) at production load

Consequences: - Comprehensive performance profile - Early bottleneck identification - Requires dedicated test environment


Phase 3: Implementation Strategy

3.1 Prerequisites

Prerequisite Status Owner Notes
Shadow scoring service deployed Required DevOps Target environment
Test environment isolated Required DevOps Avoid production impact
CloudWatch metrics configured Required DevOps Latency histograms
Sample payload corpus Required Data Eng Representative proposals

3.2 Milestones

Milestone 1: Test Infrastructure Setup (Week 1)

  • [ ] 1.1 Deploy Locust cluster in test environment
  • [ ] 1.2 Configure CloudWatch dashboard for latency metrics
  • [ ] 1.3 Create sample payload corpus (100+ representative proposals)
  • [ ] 1.4 Document baseline instance configuration

Exit Criteria: Locust can send requests to guardrail service and metrics appear in CloudWatch

Milestone 2: Baseline Performance (Week 2)

  • [ ] 2.1 Run baseline test: 10 RPS for 10 minutes
  • [ ] 2.2 Record baseline metrics:
    p50 latency:    ___ ms
    p95 latency:    ___ ms
    p99 latency:    ___ ms
    Max latency:    ___ ms
    Error rate:     ___%
    CPU util:       ___%
    Memory util:    ___%
    
  • [ ] 2.3 Identify cold start behavior
  • [ ] 2.4 Document single-instance capacity

Exit Criteria: Baseline metrics documented, no errors at 10 RPS

Milestone 3: Scale Testing (Week 3)

  • [ ] 3.1 Ramp test: 10 → 50 → 100 → 150 RPS over 30 minutes
  • [ ] 3.2 Identify inflection points where latency degrades
  • [ ] 3.3 Record metrics at each load level
  • [ ] 3.4 Calculate required instance count for production

Exit Criteria: Understand scaling characteristics, target TPS achievable

Milestone 4: Stress & Soak Testing (Week 4)

  • [ ] 4.1 Stress test: 2x expected peak load for 30 minutes
  • [ ] 4.2 Soak test: Production load for 4+ hours
  • [ ] 4.3 Monitor for memory leaks, connection exhaustion
  • [ ] 4.4 Document failure modes and recovery behavior

Exit Criteria: No degradation over extended periods, graceful failure under stress

Milestone 5: Optimization & Report (Week 5)

  • [ ] 5.1 Address identified bottlenecks:
  • [ ] Add caching if model loading is slow
  • [ ] Optimize computation if CPU-bound
  • [ ] Add connection pooling if I/O-bound
  • [ ] 5.2 Re-run tests to validate improvements
  • [ ] 5.3 Create performance test report
  • [ ] 5.4 Establish performance regression tests for CI/CD

Exit Criteria: p95 < 500 ms validated, performance report published

3.3 Risk Register

Risk L I R Mitigation
Test environment differs from prod M H 6 Use identical instance types
Payload corpus not representative M M 4 Sample from real proposals
Network latency skews results L M 2 Run Locust in same VPC
Cold start affects measurements M L 2 Warm up before measuring

Phase 4: Technical Excellence

4.1 Locust Test Script (Python)

"""
Guardrail Service Load Test
Issue #2: GAP-PerfTest
"""

from locust import HttpUser, task, between
import json
import random

# Sample payloads representing realistic proposals
SAMPLE_PAYLOADS = [
    {
        "proposal_id": "test-001",
        "risk_base": 0.3,
        "profit_base": 100000,
        "risk_prop": 0.35,
        "profit_prop": 120000,
        "novelty_score": 0.6,
        "complexity_score": 0.7,
        "quality_score": 0.8
    },
    # Add more representative payloads...
]


class GuardrailUser(HttpUser):
    """Simulates user submitting proposals for guardrail evaluation."""

    wait_time = between(0.1, 0.5)  # 2-10 requests per second per user

    @task(10)
    def evaluate_proposal(self):
        """Primary task: evaluate a proposal."""
        payload = random.choice(SAMPLE_PAYLOADS)
        payload["proposal_id"] = f"load-test-{random.randint(1, 1000000)}"

        with self.client.post(
            "/evaluate",
            json=payload,
            catch_response=True
        ) as response:
            if response.status_code == 200:
                result = response.json()
                if "guardrail_decision" in result:
                    response.success()
                else:
                    response.failure("Missing guardrail_decision in response")
            else:
                response.failure(f"Status code: {response.status_code}")

    @task(1)
    def health_check(self):
        """Occasional health check."""
        self.client.get("/health")


class StressUser(HttpUser):
    """High-frequency user for stress testing."""

    wait_time = between(0.01, 0.05)  # 20-100 requests per second per user

    @task
    def rapid_evaluate(self):
        """Rapid-fire evaluations."""
        payload = random.choice(SAMPLE_PAYLOADS)
        self.client.post("/evaluate", json=payload)

4.2 CloudWatch Dashboard Configuration

{
  "widgets": [
    {
      "type": "metric",
      "properties": {
        "title": "Response Time Percentiles",
        "metrics": [
          ["Guardrail", "Latency", "Percentile", "p50"],
          ["Guardrail", "Latency", "Percentile", "p95"],
          ["Guardrail", "Latency", "Percentile", "p99"]
        ],
        "period": 60,
        "stat": "Average",
        "annotations": {
          "horizontal": [
            { "value": 500, "label": "p95 Target", "color": "#ff0000" }
          ]
        }
      }
    },
    {
      "type": "metric",
      "properties": {
        "title": "Throughput (TPS)",
        "metrics": [
          ["Guardrail", "RequestCount", "Service", "Evaluator"]
        ],
        "period": 60,
        "stat": "Sum"
      }
    },
    {
      "type": "metric",
      "properties": {
        "title": "Error Rate",
        "metrics": [
          ["Guardrail", "ErrorCount", "Service", "Evaluator"]
        ],
        "period": 60,
        "stat": "Sum"
      }
    }
  ]
}

4.3 Performance Report Template

# Guardrail Service Performance Report

**Test Date**: 2026-01-15 (example — replace with actual test date)
**Environment**: [Staging/Production]
**Service Version**: [version]

## Summary

| Metric | Target | Achieved | Status |
|--------|--------|----------|--------|
| p95 Latency | < 500 ms | ___ ms | ✅/❌ |
| p99 Latency | < 1000 ms | ___ ms | ✅/❌ |
| Max Throughput | > 100 RPS | ___ RPS | ✅/❌ |
| Error Rate | < 0.1% | ___% | ✅/❌ |

## Test Scenarios

### Baseline (10 RPS)
- Duration: 10 minutes
- Results: [table of metrics]

### Scale (100 RPS)
- Duration: 30 minutes
- Results: [table of metrics]

### Stress (200 RPS)
- Duration: 30 minutes
- Results: [table of metrics]

### Soak (100 RPS, 4 hours)
- Duration: 4 hours
- Results: [table of metrics]

## Bottlenecks Identified

1. [Description of bottleneck]
   - Impact: [latency impact]
   - Resolution: [fix applied]

## Recommendations

1. [Optimization recommendation]
2. [Scaling recommendation]

## Appendix: Raw Data

[Link to detailed metrics export]

Phase 5: Development Workflow

5.1 Test Execution Commands

# Start Locust with web UI
locust -f load_test.py --host=https://aegis-staging.acme-corp.test

# Headless baseline test
locust -f load_test.py \
  --host=https://aegis-staging.acme-corp.test \
  --headless \
  --users 10 \
  --spawn-rate 1 \
  --run-time 10m \
  --csv=baseline

# Headless scale test
locust -f load_test.py \
  --host=https://aegis-staging.acme-corp.test \
  --headless \
  --users 100 \
  --spawn-rate 10 \
  --run-time 30m \
  --csv=scale

5.2 Testing Strategy

Test Type Description Pass Criteria
Baseline Low load, single instance p95 < 200 ms
Scale Production load estimate p95 < 500 ms
Stress 2x production peak Graceful degradation
Soak Extended duration No memory leaks
Spike Sudden load increase Recovery < 30s

5.3 CI/CD Integration

# GitHub Actions workflow for performance regression
name: Performance Regression Test

on:
  push:
    branches: [main]
  schedule:
    - cron: '0 2 * * 1'  # Weekly Monday 2 AM

jobs:
  perf-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run Performance Test
        run: |
          pip install locust
          locust -f tests/perf/load_test.py \
            --host=${{ secrets.STAGING_URL }} \
            --headless \
            --users 50 \
            --spawn-rate 5 \
            --run-time 5m \
            --csv=results

      - name: Check p95 Latency
        run: |
          P95=$(tail -1 results_stats.csv | cut -d',' -f8)
          if [ "$P95" -gt 500 ]; then
            echo "p95 latency $P95 ms exceeds 500 ms target"
            exit 1
          fi

Phase 6: Success Criteria & Metrics

6.1 Definition of Done

  • [ ] p95 latency < 500 ms validated under production load
  • [ ] Performance test report published
  • [ ] Bottlenecks identified and addressed
  • [ ] Performance regression tests in CI/CD
  • [ ] CloudWatch dashboard configured
  • [ ] Runbook for performance troubleshooting created

6.2 Success Metrics

Metric Target Measurement
p95 latency < 500 ms Load test results
Throughput capacity > 100 RPS Load test results
Error rate under load < 0.1% Load test results
Performance regression < 10% CI/CD comparison

6.3 Quality Gates

Gate Requirement Verification
Baseline complete p95 documented Test report
Scale test pass p95 < 500 ms at 100 RPS Test report
Stress test pass Graceful degradation Test report
Soak test pass No leaks over 4 hours Memory monitoring
CI/CD integration Automated regression Pipeline pass

Appendix: Research Sources

  1. AWS Machine Learning Blog. "Best practices for load testing Amazon SageMaker real-time inference endpoints." https://aws.amazon.com/blogs/machine-learning/best-practices-for-load-testing-amazon-sagemaker-real-time-inference-endpoints/

  2. Testriq. "Performance Testing for AI Applications: Speed, Scalability & Reliability at Scale." https://www.testriq.com/blog/post/performance-testing-for-ai-applications

  3. RunPod. "AI Inference Optimization: Achieving Maximum Throughput with Minimal Latency." https://www.runpod.io/articles/guides/ai-inference-optimization-achieving-maximum-throughput-with-minimal-latency

  4. TrueFoundry. "LLM Load Balancing." https://www.truefoundry.com/blog/llm-load-balancing


Changelog

Version Date Author Changes
1.0.0 2025-12-26 Claude Code Initial EPCC plan