Should we migrate from DynamoDB to PostgreSQL with Citus for a multi-tenant SaaS with 2,000 tenants and 50ms p99 latency?

accepted_conditional Infrastructure scalingsoftware_operational · Pro · 646s · $0.86

This verdict assumes 40% of constraints

The following constraints were not provided and default values were used:

7 branches explored · 3 survived · 3 rounds · integrity 75%
72% confidence
WeakStrong
Migrate to PostgreSQL with Citus on Azure Managed Hyperscale (Cosmos DB for PostgreSQL): 1 coordinator (8 vCores)
Risk unknown 646s
Decision timeline Verdict

Migrate to PostgreSQL with Citus on Azure Managed Hyperscale (Cosmos DB for PostgreSQL) with 1 coordinator (8 vCores)

Decision
72%
Execution
Uncertainty

Decision

Concrete components, topology, and thresholds named below are candidate mitigations or example implementations inferred by the Council. They were not confirmed in your filing or established as part of your current environment.

Migrate to PostgreSQL with Citus on Azure Managed Hyperscale (Cosmos DB for PostgreSQL): 1 coordinator (8 vCores, 32GB RAM) + 4 worker nodes (4 vCores, 32GB RAM each). Use tenant_id as distribution column with co-location. Estimated cost ~$4,200/month vs $28K+/month DynamoDB. Single-tenant queries (90%+ of workload) route to a single shard at 5-15ms p99; cross-tenant JOINs hit 20-45ms p99, meeting the 50ms target with ~10% headroom. Use pgloader for bulk migration, AWS DMS for CDC during cutover, with a 2-week dual-write period (DynamoDB as read fallback via application-level routing). Critical failure mode: hot tenant skew. If the top 3 tenants represent >40% of data/queries, isolate them onto dedicated worker nodes using Citus tenant isolation (shard_count=1 per large tenant). If skew exceeds 60% on any single worker, p99 will breach 50ms under concurrent load. Self-managed Citus on AWS is rejected as a hidden budget killer — dual-running DynamoDB ($28K/month) + self-managed Citus ($8K/month) + engineering blows the budget by month 4.

Next actions

Candidate estimate (inferred, not source-confirmed): Deploy proof-of-concept Azure Hyperscale Citus cluster (1 coordinator + 1 worker), load top 3 tenants by data volume, replay 24h production query logs, measure p99 latency
backend · immediate
Candidate estimate (inferred, not source-confirmed): Measure current DynamoDB hot partition distribution: identify top 3 tenants by query volume and data size, calculate skew percentage to determine if tenant isolation will be needed on Citus
data · immediate
Set up pgloader bulk migration pipeline and AWS DMS CDC replication from DynamoDB to Citus staging environment
infra · before_launch
If existing infrastructure is AWS-only, evaluate whether cross-cloud latency to Azure is acceptable or whether self-managed Citus on AWS with extended budget timeline is preferable
infra · immediate
Candidate estimate (inferred, not source-confirmed): Set up p99 latency alerting at 45ms threshold (5ms buffer) on the Citus coordinator and per-worker node query latency dashboards
infra · before_launch
This verdict stops being true when
Candidate estimate (inferred, not source-confirmed): DynamoDB costs are primarily driven by implementation issues (poor partition key design, over-provisioned capacity) and a 30%+ cost reduction is achievable through optimization alone → Optimize existing DynamoDB setup: redesign partition keys, implement auto-scaling, add DAX caching layer, defer migration
Candidate estimate (inferred, not source-confirmed): Proof-of-concept shows coordinator bottleneck at 2,000 tenants causes p99 > 50ms under production-equivalent concurrent load → Evaluate self-managed Citus on AWS with multiple coordinators, or consider CockroachDB/TiDB as distributed SQL alternatives without single-coordinator constraint
Candidate estimate (inferred, not source-confirmed): Existing infrastructure is entirely AWS-native and cross-cloud latency to Azure adds >10ms to p99, eating the safety margin → Deploy self-managed Citus on AWS EC2/EKS with increased budget allocation for DBA operational overhead
Full council reasoning, attack grid, and flip conditions included with Pro

Council notes

Vulcan
Propose a hybrid architecture: retain DynamoDB for read-heavy, non-relational workloads while introducing PostgreSQL ...
Socrates
Before considering migration, conduct a comprehensive database implementation audit of the current DynamoDB setup. Ma...
Daedalus
RECOMMENDATION: Migrate to PostgreSQL with Citus on Azure Managed Hyperscale (Citus), NOT self-managed Citus on EC2/R...
Loki
Azure Cosmos DB for PostgreSQL (Hyperscale Citus) recommendation ignores real-world coordinator bottlenecks: with 2,0...

Evidence boundary

Observed from your filing

  • Should we migrate from DynamoDB to PostgreSQL with Citus for a multi-tenant SaaS with 2,000 tenants and 50ms p99 latency?

Assumptions used for analysis

  • DynamoDB cost (~$28K/month) is the primary driver for migration, not a misidentified implementation issue
  • The existing SaaS can tolerate a cross-cloud database dependency on Azure if other services remain on AWS
  • 90%+ of queries are single-tenant scoped (tenant_id filtered), making shard-local routing the dominant access pattern
  • The engineering team has sufficient PostgreSQL operational expertise to manage the migration and ongoing operations even with managed Citus
  • The 2-week dual-write cutover window is achievable given schema complexity and data volume across 2,000 tenants
  • current scale defaulted: moderate scale assumed (not_addressed)
  • existing stack defaulted: greenfield assumed (not_addressed)
  • connection pooler defaulted: not specified (not_addressed)
  • data volume defaulted: not specified (not_addressed)
  • traffic shape defaulted: not specified (not_addressed)
  • current bottleneck defaulted: not specified (not_addressed)

Inferred candidate specifics

These details were introduced by the Council during analysis. They were not supplied in your filing.

  • Migrate to PostgreSQL with Citus on Azure Managed Hyperscale (Cosmos DB for PostgreSQL): 1 coordinator (8 vCores, 32GB RAM) + 4 worker nodes (4 vCores, 32GB RAM each). Use tenant_id as distribution column with co-location. Estimated cost ~$4,200/month vs $28K+/month DynamoDB. Single-tenant queries (90%+ of workload) route to a single shard at 5-15ms p99; cross-tenant JOINs hit 20-45ms p99, meeting the 50ms target with ~10% headroom. Use pgloader for bulk migration, AWS DMS for CDC during cutover, with a 2-week dual-write period (DynamoDB as read fallback via application-level routing). Critical failure mode: hot tenant skew. If the top 3 tenants represent >40% of data/queries, isolate them onto dedicated worker nodes using Citus tenant isolation (shard_count=1 per large tenant). If skew exceeds 60% on any single worker, p99 will breach 50ms under concurrent load. Self-managed Citus on AWS is rejected as a hidden budget killer — dual-running DynamoDB ($28K/month) + self-managed Citus ($8K/month) + engineering blows the budget by month 4.
  • Deploy a 2-node Azure Cosmos DB for PostgreSQL (Hyperscale Citus) proof-of-concept cluster with 1 coordinator + 1 worker node, load 3 representative tenants (including the largest by data volume), distribute on tenant_id, replay 24 hours of production query logs via pgbench, and measure p99 latency against the 50ms target before committing to full migration.
  • b003 (0.86 confidence) narrowly exceeded b002 (0.85) and was selected because it names specific node configurations, cost projections, migration tooling (pgloader, AWS DMS), concrete failure modes with quantified thresholds (>40% skew, >60% worker saturation), and an actionable architecture. b002 provides a sound decision framework but lacks architectural specificity — it says 'if analysis confirms, then migrate' without detailing what the migration looks like. b003 survived 3 rounds of adversarial strengthening and provides the most execution-ready path.
  • Conduct comprehensive performance audit before any migration decision (b002/b007)
  • Both b002 and b007 propose analysis-first approaches. While sound in principle, they lack architectural specificity — b002 sets a 45ms latency target and 30% cost reduction threshold but names no specific technology, node configuration, or migration tooling. b007 is even more abstract, proposing an audit without any concrete migration architecture. The question implies DynamoDB cost/complexity is already an identified problem (motivating the migration question). b003 subsumes the valid concern by specifying a 2-week dual-write validation period while providing a fully actionable architecture.
  • Hybrid DynamoDB + PostgreSQL/Citus architecture (b001, killed)
  • Doubles operational surface area without eliminating DynamoDB's read unit costs (typically 70%+ of the bill). Application-level joins between DynamoDB and Postgres entities at 2,000 tenants blow past 50ms p99. No concrete cost numbers or workload split provided.
  • Named zero specific technologies, databases, or thresholds. Proposed redefining SLAs when the 50ms p99 SLA is already specified. Structurally hollow.

Inferred specifics table

Structured audit rows for Council-added details. Synthetic basis means the detail was introduced by analysis, not supplied by the filing.

ValueKindBasisWhere introduced
shard_count=1configsyntheticchosen_path
1 coordinatorestimatesyntheticchosen_path
8 vCoresestimatesyntheticchosen_path
32GB RAMestimatesyntheticchosen_path
+ 4 worker nodesestimatesyntheticchosen_path
200/month vs $28K+/month DynamoDBestimatesyntheticchosen_path
90%+ of workloadthresholdsyntheticchosen_path
route to a single shard at 5-15ms p99thresholdsyntheticchosen_path
cross-tenant JOINs hit 20-45ms p99thresholdsyntheticchosen_path
meeting the 50ms target with ~10% headroomthresholdsyntheticchosen_path
tenants represent >40% of data/queriesthresholdsyntheticchosen_path
If skew exceeds 60% on any single workerthresholdsyntheticchosen_path
cluster with 1 coordinator + 1 workerestimatesyntheticnext_action
load 3 representative tenantsestimatesyntheticnext_action
replay 24 hours of production queryestimatesyntheticnext_action
0.86 confidenceestimatesyntheticselection_rationale
0.85estimatesyntheticselection_rationale
>40% skewthresholdsyntheticselection_rationale
>60% worker saturationthresholdsyntheticselection_rationale
b003estimatesyntheticselection_rationale

Unknowns blocking a firmer verdict

  • Coordinator bottleneck at 2,000 tenants: killed branch b005 cited case studies from Framer and Heap showing coordinator hotspotting spiking p99 to 150ms+. This was auto-pruned as unsupported but the concern is architecturally valid and untested in this specific workload profile.
  • Cross-cloud migration complexity: if existing services are on AWS, moving the database to Azure introduces cross-cloud latency and data transfer costs not accounted for in the $4,200/month estimate.
  • The $4,200/month Azure cost and $28K/month DynamoDB cost are model-generated projections without cited production benchmarks for this specific workload volume.
  • No evidence that the current DynamoDB bottleneck has been formally diagnosed — b002/b007's concern that the problem may be implementation rather than technology remains valid.
  • Actual query patterns and data volume per tenant not specified — latency projections assume typical multi-tenant SaaS workloads.

Operational signals to watch

reversal — Candidate estimate (inferred, not source-confirmed): DynamoDB costs are primarily driven by implementation issues (poor partition key design, over-provisioned capacity) and a 30%+ cost reduction is achievable through optimization alone
reversal — Candidate estimate (inferred, not source-confirmed): Proof-of-concept shows coordinator bottleneck at 2,000 tenants causes p99 > 50ms under production-equivalent concurrent load
reversal — Candidate estimate (inferred, not source-confirmed): Existing infrastructure is entirely AWS-native and cross-cloud latency to Azure adds >10ms to p99, eating the safety margin

Branch battle map

R1R2R3Censor reopenb001b002b003b004b005b006b007
Battle timeline (3 rounds)
Round 1 — Initial positions · 3 branches
Branch b001 (Vulcan) eliminated — Branch b001 proposes a hybrid DynamoDB + PostgreSQL/Citus...
Socrates proposed branch b004
Socrates Reframe the problem: Instead of asking whether to migrate from DynamoDB to Postg…
Round 2 — Adversarial probes · 3 branches
Loki proposed branch b005
Branch b005 (Loki) eliminated — auto-pruned: unsupported low-confidence branch
Socrates proposed branch b006
Branch b006 (Socrates) eliminated — auto-pruned: unsupported low-confidence branch
Loki Azure Cosmos DB for PostgreSQL (Hyperscale Citus) recommendation ignores real-wo…
Socrates Instead of a simple yes/no migration decision, we should evaluate whether a hybr…
Round 3 — Final convergence · 3 branches
Branch b004 (Socrates) eliminated — Branch b004 proposes a 'polyglot persistence strategy' wi...
Socrates proposed branch b007
Socrates Before considering migration, conduct a comprehensive database implementation au…
Markdown JSON