EPCC Implementation Plan: GAP-PerfTest¶
Issue: #2 - GAP-PerfTest: Load-Test Guardrail Service (<500 ms p95) Status: Planning Owner: DevOps Created: 2025-12-26 Target Completion: Before β-to-Prod milestone (2025-03-31)
Phase 1: Research & Validation¶
1.1 Research Findings¶
| Source | Key Finding | Applied To |
|---|---|---|
| AWS SageMaker Load Testing Best Practices | Benchmark single instance first, then extrapolate; use CloudWatch metrics for TPS and latency | Testing methodology |
| Performance Testing for AI Applications | Latency validation, scalability checks, resource efficiency, throughput benchmarking | Test objectives |
| AI Inference Optimization | Batch processing, model optimization, GPU utilization monitoring | Performance tuning |
| LLM Load Balancing | Track token generation state, adapt to varying workloads, automate health checks | Architecture patterns |
1.2 Industry Standards¶
ML Inference Latency Benchmarks:
| Use Case | Typical p95 Target | Typical p99 Target |
|---|---|---|
| Real-time trading | < 10 ms | < 50 ms |
| Fraud detection | < 100 ms | < 200 ms |
| Recommendation systems | < 200 ms | < 500 ms |
| Risk scoring (our case) | < 500 ms | < 1000 ms |
Current target p95 < 500 ms is appropriate for risk scoring workloads.
1.3 Load Testing Approaches (Evaluated)¶
| Approach | Pros | Cons | Recommendation |
|---|---|---|---|
| Locust | Python-native, scriptable, distributed | Requires custom setup | Use for flexibility |
| k6 | Modern, cloud-native, good metrics | JavaScript-based | Alternative option |
| Artillery | YAML config, easy setup | Less flexible | Quick validation |
| AWS Load Testing | Native integration | AWS-specific | Use for production validation |
Selected Approach: Locust for development, AWS Distributed Load Testing for production validation
Phase 2: Architecture Overview¶
2.1 High-Level Design¶
┌─────────────────────────────────────────────────────────────────┐
│ LOAD TESTING ARCHITECTURE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Load Gen │ │ Guardrail │ │ Metrics │ │
│ │ (Locust) │───▶│ Service │───▶│ Collector │ │
│ │ │ │ │ │ (CloudWatch) │ │
│ └──────────────┘ └──────────────┘ └──────┬───────┘ │
│ │ │ │
│ │ ┌───────────────────────────┘ │
│ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ METRICS DASHBOARD │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ p50 lat │ │ p95 lat │ │ p99 lat │ │ TPS │ │ │
│ │ │ <100ms │ │ <500ms │ │ <1000ms │ │ >100/s │ │ │
│ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
2.2 Performance Targets¶
| Metric | Target | Measurement | Alert Threshold |
|---|---|---|---|
| p50 latency | < 100 ms | CloudWatch | > 150 ms |
| p95 latency | < 500 ms | CloudWatch | > 500 ms |
| p99 latency | < 1000 ms | CloudWatch | > 1000 ms |
| Throughput | > 100 req/s | CloudWatch | < 80 req/s |
| Error rate | < 0.1% | CloudWatch | > 0.5% |
| CPU utilization | < 70% | CloudWatch | > 80% |
| Memory utilization | < 80% | CloudWatch | > 85% |
2.3 ADR: Load Testing Strategy¶
ADR-002: Performance Testing Strategy
Status: Proposed
Context: Need to validate guardrail service meets p95 < 500 ms under production load.
Decision Drivers: - Must simulate realistic production traffic patterns - Need to identify bottlenecks before production - Require repeatable, automated tests
Decision: Use staged load testing approach: 1. Baseline: Single instance, low load (10 RPS) 2. Scale: Increase to production estimate (100+ RPS) 3. Stress: Push beyond expected peak (2x production) 4. Soak: Extended duration (4+ hours) at production load
Consequences: - Comprehensive performance profile - Early bottleneck identification - Requires dedicated test environment
Phase 3: Implementation Strategy¶
3.1 Prerequisites¶
| Prerequisite | Status | Owner | Notes |
|---|---|---|---|
| Shadow scoring service deployed | Required | DevOps | Target environment |
| Test environment isolated | Required | DevOps | Avoid production impact |
| CloudWatch metrics configured | Required | DevOps | Latency histograms |
| Sample payload corpus | Required | Data Eng | Representative proposals |
3.2 Milestones¶
Milestone 1: Test Infrastructure Setup (Week 1)¶
- [ ] 1.1 Deploy Locust cluster in test environment
- [ ] 1.2 Configure CloudWatch dashboard for latency metrics
- [ ] 1.3 Create sample payload corpus (100+ representative proposals)
- [ ] 1.4 Document baseline instance configuration
Exit Criteria: Locust can send requests to guardrail service and metrics appear in CloudWatch
Milestone 2: Baseline Performance (Week 2)¶
- [ ] 2.1 Run baseline test: 10 RPS for 10 minutes
- [ ] 2.2 Record baseline metrics:
- [ ] 2.3 Identify cold start behavior
- [ ] 2.4 Document single-instance capacity
Exit Criteria: Baseline metrics documented, no errors at 10 RPS
Milestone 3: Scale Testing (Week 3)¶
- [ ] 3.1 Ramp test: 10 → 50 → 100 → 150 RPS over 30 minutes
- [ ] 3.2 Identify inflection points where latency degrades
- [ ] 3.3 Record metrics at each load level
- [ ] 3.4 Calculate required instance count for production
Exit Criteria: Understand scaling characteristics, target TPS achievable
Milestone 4: Stress & Soak Testing (Week 4)¶
- [ ] 4.1 Stress test: 2x expected peak load for 30 minutes
- [ ] 4.2 Soak test: Production load for 4+ hours
- [ ] 4.3 Monitor for memory leaks, connection exhaustion
- [ ] 4.4 Document failure modes and recovery behavior
Exit Criteria: No degradation over extended periods, graceful failure under stress
Milestone 5: Optimization & Report (Week 5)¶
- [ ] 5.1 Address identified bottlenecks:
- [ ] Add caching if model loading is slow
- [ ] Optimize computation if CPU-bound
- [ ] Add connection pooling if I/O-bound
- [ ] 5.2 Re-run tests to validate improvements
- [ ] 5.3 Create performance test report
- [ ] 5.4 Establish performance regression tests for CI/CD
Exit Criteria: p95 < 500 ms validated, performance report published
3.3 Risk Register¶
| Risk | L | I | R | Mitigation |
|---|---|---|---|---|
| Test environment differs from prod | M | H | 6 | Use identical instance types |
| Payload corpus not representative | M | M | 4 | Sample from real proposals |
| Network latency skews results | L | M | 2 | Run Locust in same VPC |
| Cold start affects measurements | M | L | 2 | Warm up before measuring |
Phase 4: Technical Excellence¶
4.1 Locust Test Script (Python)¶
"""
Guardrail Service Load Test
Issue #2: GAP-PerfTest
"""
from locust import HttpUser, task, between
import json
import random
# Sample payloads representing realistic proposals
SAMPLE_PAYLOADS = [
{
"proposal_id": "test-001",
"risk_base": 0.3,
"profit_base": 100000,
"risk_prop": 0.35,
"profit_prop": 120000,
"novelty_score": 0.6,
"complexity_score": 0.7,
"quality_score": 0.8
},
# Add more representative payloads...
]
class GuardrailUser(HttpUser):
"""Simulates user submitting proposals for guardrail evaluation."""
wait_time = between(0.1, 0.5) # 2-10 requests per second per user
@task(10)
def evaluate_proposal(self):
"""Primary task: evaluate a proposal."""
payload = random.choice(SAMPLE_PAYLOADS)
payload["proposal_id"] = f"load-test-{random.randint(1, 1000000)}"
with self.client.post(
"/evaluate",
json=payload,
catch_response=True
) as response:
if response.status_code == 200:
result = response.json()
if "guardrail_decision" in result:
response.success()
else:
response.failure("Missing guardrail_decision in response")
else:
response.failure(f"Status code: {response.status_code}")
@task(1)
def health_check(self):
"""Occasional health check."""
self.client.get("/health")
class StressUser(HttpUser):
"""High-frequency user for stress testing."""
wait_time = between(0.01, 0.05) # 20-100 requests per second per user
@task
def rapid_evaluate(self):
"""Rapid-fire evaluations."""
payload = random.choice(SAMPLE_PAYLOADS)
self.client.post("/evaluate", json=payload)
4.2 CloudWatch Dashboard Configuration¶
{
"widgets": [
{
"type": "metric",
"properties": {
"title": "Response Time Percentiles",
"metrics": [
["Guardrail", "Latency", "Percentile", "p50"],
["Guardrail", "Latency", "Percentile", "p95"],
["Guardrail", "Latency", "Percentile", "p99"]
],
"period": 60,
"stat": "Average",
"annotations": {
"horizontal": [
{ "value": 500, "label": "p95 Target", "color": "#ff0000" }
]
}
}
},
{
"type": "metric",
"properties": {
"title": "Throughput (TPS)",
"metrics": [
["Guardrail", "RequestCount", "Service", "Evaluator"]
],
"period": 60,
"stat": "Sum"
}
},
{
"type": "metric",
"properties": {
"title": "Error Rate",
"metrics": [
["Guardrail", "ErrorCount", "Service", "Evaluator"]
],
"period": 60,
"stat": "Sum"
}
}
]
}
4.3 Performance Report Template¶
# Guardrail Service Performance Report
**Test Date**: 2026-01-15 (example — replace with actual test date)
**Environment**: [Staging/Production]
**Service Version**: [version]
## Summary
| Metric | Target | Achieved | Status |
|--------|--------|----------|--------|
| p95 Latency | < 500 ms | ___ ms | ✅/❌ |
| p99 Latency | < 1000 ms | ___ ms | ✅/❌ |
| Max Throughput | > 100 RPS | ___ RPS | ✅/❌ |
| Error Rate | < 0.1% | ___% | ✅/❌ |
## Test Scenarios
### Baseline (10 RPS)
- Duration: 10 minutes
- Results: [table of metrics]
### Scale (100 RPS)
- Duration: 30 minutes
- Results: [table of metrics]
### Stress (200 RPS)
- Duration: 30 minutes
- Results: [table of metrics]
### Soak (100 RPS, 4 hours)
- Duration: 4 hours
- Results: [table of metrics]
## Bottlenecks Identified
1. [Description of bottleneck]
- Impact: [latency impact]
- Resolution: [fix applied]
## Recommendations
1. [Optimization recommendation]
2. [Scaling recommendation]
## Appendix: Raw Data
[Link to detailed metrics export]
Phase 5: Development Workflow¶
5.1 Test Execution Commands¶
# Start Locust with web UI
locust -f load_test.py --host=https://aegis-staging.acme-corp.test
# Headless baseline test
locust -f load_test.py \
--host=https://aegis-staging.acme-corp.test \
--headless \
--users 10 \
--spawn-rate 1 \
--run-time 10m \
--csv=baseline
# Headless scale test
locust -f load_test.py \
--host=https://aegis-staging.acme-corp.test \
--headless \
--users 100 \
--spawn-rate 10 \
--run-time 30m \
--csv=scale
5.2 Testing Strategy¶
| Test Type | Description | Pass Criteria |
|---|---|---|
| Baseline | Low load, single instance | p95 < 200 ms |
| Scale | Production load estimate | p95 < 500 ms |
| Stress | 2x production peak | Graceful degradation |
| Soak | Extended duration | No memory leaks |
| Spike | Sudden load increase | Recovery < 30s |
5.3 CI/CD Integration¶
# GitHub Actions workflow for performance regression
name: Performance Regression Test
on:
push:
branches: [main]
schedule:
- cron: '0 2 * * 1' # Weekly Monday 2 AM
jobs:
perf-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Performance Test
run: |
pip install locust
locust -f tests/perf/load_test.py \
--host=${{ secrets.STAGING_URL }} \
--headless \
--users 50 \
--spawn-rate 5 \
--run-time 5m \
--csv=results
- name: Check p95 Latency
run: |
P95=$(tail -1 results_stats.csv | cut -d',' -f8)
if [ "$P95" -gt 500 ]; then
echo "p95 latency $P95 ms exceeds 500 ms target"
exit 1
fi
Phase 6: Success Criteria & Metrics¶
6.1 Definition of Done¶
- [ ] p95 latency < 500 ms validated under production load
- [ ] Performance test report published
- [ ] Bottlenecks identified and addressed
- [ ] Performance regression tests in CI/CD
- [ ] CloudWatch dashboard configured
- [ ] Runbook for performance troubleshooting created
6.2 Success Metrics¶
| Metric | Target | Measurement |
|---|---|---|
| p95 latency | < 500 ms | Load test results |
| Throughput capacity | > 100 RPS | Load test results |
| Error rate under load | < 0.1% | Load test results |
| Performance regression | < 10% | CI/CD comparison |
6.3 Quality Gates¶
| Gate | Requirement | Verification |
|---|---|---|
| Baseline complete | p95 documented | Test report |
| Scale test pass | p95 < 500 ms at 100 RPS | Test report |
| Stress test pass | Graceful degradation | Test report |
| Soak test pass | No leaks over 4 hours | Memory monitoring |
| CI/CD integration | Automated regression | Pipeline pass |
Appendix: Research Sources¶
-
AWS Machine Learning Blog. "Best practices for load testing Amazon SageMaker real-time inference endpoints." https://aws.amazon.com/blogs/machine-learning/best-practices-for-load-testing-amazon-sagemaker-real-time-inference-endpoints/
-
Testriq. "Performance Testing for AI Applications: Speed, Scalability & Reliability at Scale." https://www.testriq.com/blog/post/performance-testing-for-ai-applications
-
RunPod. "AI Inference Optimization: Achieving Maximum Throughput with Minimal Latency." https://www.runpod.io/articles/guides/ai-inference-optimization-achieving-maximum-throughput-with-minimal-latency
-
TrueFoundry. "LLM Load Balancing." https://www.truefoundry.com/blog/llm-load-balancing
Changelog¶
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0.0 | 2025-12-26 | Claude Code | Initial EPCC plan |