Should we migrate from DynamoDB to PostgreSQL with Citus for a multi-tenant SaaS with 2,000 tenants and 50ms p99 latency?
This verdict assumes 40% of constraints
The following constraints were not provided and default values were used:
- current_scale: moderate scale assumed (not_addressed)
- existing_stack: greenfield assumed (not_addressed)
- connection_pooler: not specified (not_addressed)
- data_volume: not specified (not_addressed)
- traffic_shape: not specified (not_addressed)
- current_bottleneck: not specified (not_addressed)
Migrate to PostgreSQL with Citus on Azure Managed Hyperscale (Cosmos DB for PostgreSQL) with 1 coordinator (8 vCores)
Decision
Migrate to PostgreSQL with Citus on Azure Managed Hyperscale (Cosmos DB for PostgreSQL): 1 coordinator (8 vCores, 32GB RAM) + 4 worker nodes (4 vCores, 32GB RAM each). Use tenant_id as distribution column with co-location. Estimated cost ~$4,200/month vs $28K+/month DynamoDB. Single-tenant queries (90%+ of workload) route to a single shard at 5-15ms p99; cross-tenant JOINs hit 20-45ms p99, meeting the 50ms target with ~10% headroom. Use pgloader for bulk migration, AWS DMS for CDC during cutover, with a 2-week dual-write period (DynamoDB as read fallback via application-level routing). Critical failure mode: hot tenant skew. If the top 3 tenants represent >40% of data/queries, isolate them onto dedicated worker nodes using Citus tenant isolation (shard_count=1 per large tenant). If skew exceeds 60% on any single worker, p99 will breach 50ms under concurrent load. Self-managed Citus on AWS is rejected as a hidden budget killer — dual-running DynamoDB ($28K/month) + self-managed Citus ($8K/month) + engineering blows the budget by month 4.
Next actions
Council notes
Evidence boundary
Observed from your filing
- Should we migrate from DynamoDB to PostgreSQL with Citus for a multi-tenant SaaS with 2,000 tenants and 50ms p99 latency?
Assumptions used for analysis
- DynamoDB cost (~$28K/month) is the primary driver for migration, not a misidentified implementation issue
- The existing SaaS can tolerate a cross-cloud database dependency on Azure if other services remain on AWS
- 90%+ of queries are single-tenant scoped (tenant_id filtered), making shard-local routing the dominant access pattern
- The engineering team has sufficient PostgreSQL operational expertise to manage the migration and ongoing operations even with managed Citus
- The 2-week dual-write cutover window is achievable given schema complexity and data volume across 2,000 tenants
- current scale defaulted: moderate scale assumed (not_addressed)
- existing stack defaulted: greenfield assumed (not_addressed)
- connection pooler defaulted: not specified (not_addressed)
- data volume defaulted: not specified (not_addressed)
- traffic shape defaulted: not specified (not_addressed)
- current bottleneck defaulted: not specified (not_addressed)
Inferred candidate specifics
- Migrate to PostgreSQL with Citus on Azure Managed Hyperscale (Cosmos DB for PostgreSQL): 1 coordinator (8 vCores, 32GB RAM) + 4 worker nodes (4 vCores, 32GB RAM each). Use tenant_id as distribution column with co-location. Estimated cost ~$4,200/month vs $28K+/month DynamoDB. Single-tenant queries (90%+ of workload) route to a single shard at 5-15ms p99; cross-tenant JOINs hit 20-45ms p99, meeting the 50ms target with ~10% headroom. Use pgloader for bulk migration, AWS DMS for CDC during cutover, with a 2-week dual-write period (DynamoDB as read fallback via application-level routing). Critical failure mode: hot tenant skew. If the top 3 tenants represent >40% of data/queries, isolate them onto dedicated worker nodes using Citus tenant isolation (shard_count=1 per large tenant). If skew exceeds 60% on any single worker, p99 will breach 50ms under concurrent load. Self-managed Citus on AWS is rejected as a hidden budget killer — dual-running DynamoDB ($28K/month) + self-managed Citus ($8K/month) + engineering blows the budget by month 4.
- Deploy a 2-node Azure Cosmos DB for PostgreSQL (Hyperscale Citus) proof-of-concept cluster with 1 coordinator + 1 worker node, load 3 representative tenants (including the largest by data volume), distribute on tenant_id, replay 24 hours of production query logs via pgbench, and measure p99 latency against the 50ms target before committing to full migration.
- b003 (0.86 confidence) narrowly exceeded b002 (0.85) and was selected because it names specific node configurations, cost projections, migration tooling (pgloader, AWS DMS), concrete failure modes with quantified thresholds (>40% skew, >60% worker saturation), and an actionable architecture. b002 provides a sound decision framework but lacks architectural specificity — it says 'if analysis confirms, then migrate' without detailing what the migration looks like. b003 survived 3 rounds of adversarial strengthening and provides the most execution-ready path.
- Conduct comprehensive performance audit before any migration decision (b002/b007)
- Both b002 and b007 propose analysis-first approaches. While sound in principle, they lack architectural specificity — b002 sets a 45ms latency target and 30% cost reduction threshold but names no specific technology, node configuration, or migration tooling. b007 is even more abstract, proposing an audit without any concrete migration architecture. The question implies DynamoDB cost/complexity is already an identified problem (motivating the migration question). b003 subsumes the valid concern by specifying a 2-week dual-write validation period while providing a fully actionable architecture.
- Hybrid DynamoDB + PostgreSQL/Citus architecture (b001, killed)
- Doubles operational surface area without eliminating DynamoDB's read unit costs (typically 70%+ of the bill). Application-level joins between DynamoDB and Postgres entities at 2,000 tenants blow past 50ms p99. No concrete cost numbers or workload split provided.
- Named zero specific technologies, databases, or thresholds. Proposed redefining SLAs when the 50ms p99 SLA is already specified. Structurally hollow.
Inferred specifics table
| Value | Kind | Basis | Where introduced |
|---|---|---|---|
| shard_count=1 | config | synthetic | chosen_path |
| 1 coordinator | estimate | synthetic | chosen_path |
| 8 vCores | estimate | synthetic | chosen_path |
| 32GB RAM | estimate | synthetic | chosen_path |
| + 4 worker nodes | estimate | synthetic | chosen_path |
| 200/month vs $28K+/month DynamoDB | estimate | synthetic | chosen_path |
| 90%+ of workload | threshold | synthetic | chosen_path |
| route to a single shard at 5-15ms p99 | threshold | synthetic | chosen_path |
| cross-tenant JOINs hit 20-45ms p99 | threshold | synthetic | chosen_path |
| meeting the 50ms target with ~10% headroom | threshold | synthetic | chosen_path |
| tenants represent >40% of data/queries | threshold | synthetic | chosen_path |
| If skew exceeds 60% on any single worker | threshold | synthetic | chosen_path |
| cluster with 1 coordinator + 1 worker | estimate | synthetic | next_action |
| load 3 representative tenants | estimate | synthetic | next_action |
| replay 24 hours of production query | estimate | synthetic | next_action |
| 0.86 confidence | estimate | synthetic | selection_rationale |
| 0.85 | estimate | synthetic | selection_rationale |
| >40% skew | threshold | synthetic | selection_rationale |
| >60% worker saturation | threshold | synthetic | selection_rationale |
| b003 | estimate | synthetic | selection_rationale |
Unknowns blocking a firmer verdict
- Coordinator bottleneck at 2,000 tenants: killed branch b005 cited case studies from Framer and Heap showing coordinator hotspotting spiking p99 to 150ms+. This was auto-pruned as unsupported but the concern is architecturally valid and untested in this specific workload profile.
- Cross-cloud migration complexity: if existing services are on AWS, moving the database to Azure introduces cross-cloud latency and data transfer costs not accounted for in the $4,200/month estimate.
- The $4,200/month Azure cost and $28K/month DynamoDB cost are model-generated projections without cited production benchmarks for this specific workload volume.
- No evidence that the current DynamoDB bottleneck has been formally diagnosed — b002/b007's concern that the problem may be implementation rather than technology remains valid.
- Actual query patterns and data volume per tenant not specified — latency projections assume typical multi-tenant SaaS workloads.