Compute & Orchestration¶
The compute layer is split into two planes: a data plane (ECS on EC2) that runs long-lived API servers and per-user worker processes, and a control plane (Lambda + SQS) that orchestrates worker lifecycle, background tasks, and maintenance. This separation means the orchestrator has no state to lose and the workers have no management overhead.
Operational guarantees
Zero-downtime deployments via ECS rolling updates with circuit breaker (100% min healthy, 200% max). Automatic rollback if new tasks fail health checks. No single point of failure in the control plane — Lambda functions are inherently HA with AWS-managed retries. The data plane recovers from any single-container failure within 60 seconds via the maintenance Lambda.
ECS Cluster¶
The cluster uses two capacity providers backed by separate Auto Scaling Groups, each optimized for its workload profile:
Why Two Capacity Providers
API tasks need CPU headroom for request processing (compute-bound). Worker tasks need memory for broker SDK sessions but barely touch the CPU during idle periods (memory-bound). Mixing them on the same instance type wastes resources in both directions.
API Capacity Provider¶
| Parameter | Value |
|---|---|
| Instance Type | m7g.large (Graviton, 2 vCPU, 8 GB) |
| Tasks per Instance | 2 |
| CPU per Task | 1024 units (0.5 vCPU) |
| Memory per Task | 1536 MB soft / 3072 MB hard |
| Web Server | Gunicorn with 8 Uvicorn workers |
| Network Mode | bridge (dynamic host ports) |
| Deployment | Rolling update, circuit breaker enabled |
| Min Healthy % | 100% (zero-downtime deploys) |
| Max % | 200% (double capacity during deploy) |
Auto-Scaling Policies¶
The API service uses target-tracking scaling on two dimensions:
| Policy | Target | Scale-Out Cooldown | Scale-In Cooldown |
|---|---|---|---|
| CPU Utilization | 70% average | 60s | 300s |
| Request Count | 1000 req/min per target | 60s | 300s |
Scale-in is deliberately slow (300s cooldown) to avoid thrashing during intermittent traffic spikes around market open/close.
Rolling Deployment¶
Deployments use ECS rolling update with circuit breaker:
- New task definition registered
- ECS launches new tasks (up to 200% capacity)
- ALB health checks confirm new tasks are healthy
- Old tasks drain connections (300s deregistration delay)
- Old tasks stopped
If the new tasks fail health checks, the circuit breaker automatically rolls back to the previous task definition — no manual intervention required.
Worker Capacity Provider¶
| Parameter | Value |
|---|---|
| Instance Type | r7g.xlarge (Graviton, 4 vCPU, 32 GB) |
| Tasks per Instance | 60 (conservative) |
| CPU per Task | 64 units |
| Memory Soft Limit | 384 MB |
| Memory Hard Limit | 1024 MB |
| Network Mode | bridge (shared host ENI) |
| Desired Count | Managed by Lambda orchestrator |
Capacity Math¶
Each r7g.xlarge provides 4096 CPU units and 32,768 MB RAM:
| Resource | Available | Per Task | Max Tasks | Limiting? |
|---|---|---|---|---|
| CPU | 4096 units | 64 units | 4096/64 = 64 | No |
| Memory | 32,768 MB | 384 MB soft | 32768/384 = 85 | Yes (soft) |
| Memory (hard) | 32,768 MB | 1024 MB hard | 32768/1024 = 32 | Worst-case |
The target of 60 tasks per instance is conservative — it leaves headroom for:
- OS and ECS agent overhead (~512 MB)
- Temporary memory spikes during broker API calls
- Container runtime overhead
OOM Protection
If a worker exceeds its 1024 MB hard limit, the Linux OOM killer terminates only that container. The EC2 instance and all other workers continue unaffected. The maintenance Lambda detects the missing worker within 60 seconds and the orchestrator restarts it automatically.
Bridge Networking¶
Workers use bridge mode instead of awsvpc:
| Feature | awsvpc | bridge |
|---|---|---|
| ENI per task | Yes (1 each) | No (shared) |
| Max tasks (m/r xlarge) | ~3 | 60+ |
| Per-task security group | Yes | No (host SG) |
| Port mapping | Static | Dynamic |
| Cost impact | ENI limits require more instances | High density, fewer instances |
The trade-off is acceptable because workers only need outbound internet to reach broker APIs. They don't receive inbound connections — all communication flows through Redis.
Lambda Orchestrator¶
Five Lambda functions form the serverless control plane:
| Function | Trigger | Timeout | Memory | Concurrency | Purpose |
|---|---|---|---|---|---|
| worker_control | SQS FIFO (worker-control) | 60s | 256 MB | 50–500 | Start, stop, claim workers. Pool assignment and RunTask fallback. |
| order_tasks | SQS FIFO (order-tasks) | 120s | 256 MB | 50–500 | Background fill verification. Query broker for order status after execution. |
| maintenance | EventBridge (every 60s) | 300s | 256 MB | 1 | Fan-out coordinator. Scans Redis for all worker marks, partitions work, invokes maintenance_worker in parallel. |
| maintenance_worker | Lambda invoke (from maintenance) | 30s | 256 MB | 100 | Process individual orphan detection batch. Check ECS task status, clean up stale marks, stop orphan tasks. |
| pool_manager | EventBridge (every 5 min) | 60s | 256 MB | — | Count pool workers, compare to target, launch or terminate to match desired pool size. |
Orchestrator Flow¶
FIFO Guarantees
The worker-control.fifo queue uses user_id as the message group ID. This ensures that multiple start/stop commands for the same user are processed in order, preventing race conditions where a stop arrives before the start has completed.
SQS Queues¶
| Queue | Type | Visibility Timeout | Retention | DLQ | DLQ Max Receives | Purpose |
|---|---|---|---|---|---|---|
| worker-control.fifo | FIFO | 90s | 1 day | worker-control-dlq.fifo | 3 | Worker lifecycle commands (start, stop, claim). Message group: user_id. |
| order-tasks.fifo | FIFO | 180s | 1 day | order-tasks-dlq.fifo | 3 | Fill verification, delayed order checks. Message group: order_id. |
| pool-claim | Standard | 10s | 5 minutes | — | — | One-shot claim messages for pool workers. Short retention because unclaimed messages are stale. |
Dead Letter Queues¶
Both FIFO queues have DLQs that catch messages failing after 3 processing attempts. Both Lambda handlers use ReportBatchItemFailures so that only the specific failing record is retried — successfully processed records in the same batch are not re-delivered and do not have their receive count inflated.
| DLQ | Retention | CloudWatch Alarm | Dashboard |
|---|---|---|---|
| worker-control-dlq.fifo | 14 days | {env}-orchestrator-dlq-has-messages (> 0) |
Yes |
| order-tasks-dlq.fifo | 14 days | {env}-order-tasks-dlq-has-messages (> 0) |
Yes |
Both alarms send to the orchestrator-alerts SNS topic (email notification). The CloudWatch dashboard shows both DLQ message counts side by side.
DLQ Messages Are Genuine Failures
With ReportBatchItemFailures, only messages that truly failed 3 consecutive times reach the DLQ — no false positives from batch contamination. Common causes: Redis connectivity loss, ECS capacity exhausted, broker API persistently timing out. Action: check the corresponding Lambda error in CloudWatch Logs, fix the root cause, then redrive messages from the DLQ back to the main queue.