Should we migrate from DynamoDB to PostgreSQL with Citus for a multi-tenant SaaS with 2,000 tenants and 50ms p99 latency?

accepted_conditional Infrastructure scalingsoftware_operational · Pro · 646s · $0.86

This verdict assumes 40% of constraints

The following constraints were not provided and default values were used:

current_scale: moderate scale assumed (not_addressed)
existing_stack: greenfield assumed (not_addressed)
connection_pooler: not specified (not_addressed)
data_volume: not specified (not_addressed)
traffic_shape: not specified (not_addressed)
current_bottleneck: not specified (not_addressed)

7 branches explored · 3 survived · 3 rounds · integrity 75%

WeakStrong

Migrate to PostgreSQL with Citus on Azure Managed Hyperscale (Cosmos DB for PostgreSQL): 1 coordinator (8 vCores)

Risk unknown 646s

Read brief Open timeline MD ↓ Pro JSON ↓ Pro PDF ↓ Ent

Decision timeline Verdict

Migrate to PostgreSQL with Citus on Azure Managed Hyperscale (Cosmos DB for PostgreSQL) with 1 coordinator (8 vCores)

Decision

72%

Execution

—

Uncertainty

—

Decision

Concrete components, topology, and thresholds named below are candidate mitigations or example implementations inferred by the Council. They were not confirmed in your filing or established as part of your current environment.

Migrate to PostgreSQL with Citus on Azure Managed Hyperscale (Cosmos DB for PostgreSQL): 1 coordinator (8 vCores, 32GB RAM) + 4 worker nodes (4 vCores, 32GB RAM each). Use tenant_id as distribution column with co-location. Estimated cost ~$4,200/month vs $28K+/month DynamoDB. Single-tenant queries (90%+ of workload) route to a single shard at 5-15ms p99; cross-tenant JOINs hit 20-45ms p99, meeting the 50ms target with ~10% headroom. Use pgloader for bulk migration, AWS DMS for CDC during cutover, with a 2-week dual-write period (DynamoDB as read fallback via application-level routing). Critical failure mode: hot tenant skew. If the top 3 tenants represent >40% of data/queries, isolate them onto dedicated worker nodes using Citus tenant isolation (shard_count=1 per large tenant). If skew exceeds 60% on any single worker, p99 will breach 50ms under concurrent load. Self-managed Citus on AWS is rejected as a hidden budget killer — dual-running DynamoDB ($28K/month) + self-managed Citus ($8K/month) + engineering blows the budget by month 4.

Next actions

Candidate estimate (inferred, not source-confirmed): Deploy proof-of-concept Azure Hyperscale Citus cluster (1 coordinator + 1 worker), load top 3 tenants by data volume, replay 24h production query logs, measure p99 latency

backend · immediate

Candidate estimate (inferred, not source-confirmed): Measure current DynamoDB hot partition distribution: identify top 3 tenants by query volume and data size, calculate skew percentage to determine if tenant isolation will be needed on Citus

data · immediate

Set up pgloader bulk migration pipeline and AWS DMS CDC replication from DynamoDB to Citus staging environment

infra · before_launch

If existing infrastructure is AWS-only, evaluate whether cross-cloud latency to Azure is acceptable or whether self-managed Citus on AWS with extended budget timeline is preferable

infra · immediate

Candidate estimate (inferred, not source-confirmed): Set up p99 latency alerting at 45ms threshold (5ms buffer) on the Citus coordinator and per-worker node query latency dashboards

infra · before_launch

This verdict stops being true when

Candidate estimate (inferred, not source-confirmed): DynamoDB costs are primarily driven by implementation issues (poor partition key design, over-provisioned capacity) and a 30%+ cost reduction is achievable through optimization alone → Optimize existing DynamoDB setup: redesign partition keys, implement auto-scaling, add DAX caching layer, defer migration

Candidate estimate (inferred, not source-confirmed): Proof-of-concept shows coordinator bottleneck at 2,000 tenants causes p99 > 50ms under production-equivalent concurrent load → Evaluate self-managed Citus on AWS with multiple coordinators, or consider CockroachDB/TiDB as distributed SQL alternatives without single-coordinator constraint

Candidate estimate (inferred, not source-confirmed): Existing infrastructure is entirely AWS-native and cross-cloud latency to Azure adds >10ms to p99, eating the safety margin → Deploy self-managed Citus on AWS EC2/EKS with increased budget allocation for DBA operational overhead

Full council reasoning, attack grid, and flip conditions included with Pro

Council notes

Vulcan

Propose a hybrid architecture: retain DynamoDB for read-heavy, non-relational workloads while introducing PostgreSQL ...

Socrates

Before considering migration, conduct a comprehensive database implementation audit of the current DynamoDB setup. Ma...

Daedalus

RECOMMENDATION: Migrate to PostgreSQL with Citus on Azure Managed Hyperscale (Citus), NOT self-managed Citus on EC2/R...

Loki

Azure Cosmos DB for PostgreSQL (Hyperscale Citus) recommendation ignores real-world coordinator bottlenecks: with 2,0...

Evidence boundary

Observed from your filing

Should we migrate from DynamoDB to PostgreSQL with Citus for a multi-tenant SaaS with 2,000 tenants and 50ms p99 latency?

Assumptions used for analysis

DynamoDB cost (~$28K/month) is the primary driver for migration, not a misidentified implementation issue
The existing SaaS can tolerate a cross-cloud database dependency on Azure if other services remain on AWS
90%+ of queries are single-tenant scoped (tenant_id filtered), making shard-local routing the dominant access pattern
The engineering team has sufficient PostgreSQL operational expertise to manage the migration and ongoing operations even with managed Citus
The 2-week dual-write cutover window is achievable given schema complexity and data volume across 2,000 tenants
current scale defaulted: moderate scale assumed (not_addressed)
existing stack defaulted: greenfield assumed (not_addressed)
connection pooler defaulted: not specified (not_addressed)
data volume defaulted: not specified (not_addressed)
traffic shape defaulted: not specified (not_addressed)
current bottleneck defaulted: not specified (not_addressed)

Inferred candidate specifics

These details were introduced by the Council during analysis. They were not supplied in your filing.

Migrate to PostgreSQL with Citus on Azure Managed Hyperscale (Cosmos DB for PostgreSQL): 1 coordinator (8 vCores, 32GB RAM) + 4 worker nodes (4 vCores, 32GB RAM each). Use tenant_id as distribution column with co-location. Estimated cost ~$4,200/month vs $28K+/month DynamoDB. Single-tenant queries (90%+ of workload) route to a single shard at 5-15ms p99; cross-tenant JOINs hit 20-45ms p99, meeting the 50ms target with ~10% headroom. Use pgloader for bulk migration, AWS DMS for CDC during cutover, with a 2-week dual-write period (DynamoDB as read fallback via application-level routing). Critical failure mode: hot tenant skew. If the top 3 tenants represent >40% of data/queries, isolate them onto dedicated worker nodes using Citus tenant isolation (shard_count=1 per large tenant). If skew exceeds 60% on any single worker, p99 will breach 50ms under concurrent load. Self-managed Citus on AWS is rejected as a hidden budget killer — dual-running DynamoDB ($28K/month) + self-managed Citus ($8K/month) + engineering blows the budget by month 4.
Deploy a 2-node Azure Cosmos DB for PostgreSQL (Hyperscale Citus) proof-of-concept cluster with 1 coordinator + 1 worker node, load 3 representative tenants (including the largest by data volume), distribute on tenant_id, replay 24 hours of production query logs via pgbench, and measure p99 latency against the 50ms target before committing to full migration.
b003 (0.86 confidence) narrowly exceeded b002 (0.85) and was selected because it names specific node configurations, cost projections, migration tooling (pgloader, AWS DMS), concrete failure modes with quantified thresholds (>40% skew, >60% worker saturation), and an actionable architecture. b002 provides a sound decision framework but lacks architectural specificity — it says 'if analysis confirms, then migrate' without detailing what the migration looks like. b003 survived 3 rounds of adversarial strengthening and provides the most execution-ready path.
Conduct comprehensive performance audit before any migration decision (b002/b007)
Both b002 and b007 propose analysis-first approaches. While sound in principle, they lack architectural specificity — b002 sets a 45ms latency target and 30% cost reduction threshold but names no specific technology, node configuration, or migration tooling. b007 is even more abstract, proposing an audit without any concrete migration architecture. The question implies DynamoDB cost/complexity is already an identified problem (motivating the migration question). b003 subsumes the valid concern by specifying a 2-week dual-write validation period while providing a fully actionable architecture.
Hybrid DynamoDB + PostgreSQL/Citus architecture (b001, killed)
Doubles operational surface area without eliminating DynamoDB's read unit costs (typically 70%+ of the bill). Application-level joins between DynamoDB and Postgres entities at 2,000 tenants blow past 50ms p99. No concrete cost numbers or workload split provided.
Named zero specific technologies, databases, or thresholds. Proposed redefining SLAs when the 50ms p99 SLA is already specified. Structurally hollow.

Inferred specifics table

Structured audit rows for Council-added details. Synthetic basis means the detail was introduced by analysis, not supplied by the filing.

Value	Kind	Basis	Where introduced
shard_count=1	config	synthetic	chosen_path
1 coordinator	estimate	synthetic	chosen_path
8 vCores	estimate	synthetic	chosen_path
32GB RAM	estimate	synthetic	chosen_path
+ 4 worker nodes	estimate	synthetic	chosen_path
200/month vs $28K+/month DynamoDB	estimate	synthetic	chosen_path
90%+ of workload	threshold	synthetic	chosen_path
route to a single shard at 5-15ms p99	threshold	synthetic	chosen_path
cross-tenant JOINs hit 20-45ms p99	threshold	synthetic	chosen_path
meeting the 50ms target with ~10% headroom	threshold	synthetic	chosen_path
tenants represent >40% of data/queries	threshold	synthetic	chosen_path
If skew exceeds 60% on any single worker	threshold	synthetic	chosen_path
cluster with 1 coordinator + 1 worker	estimate	synthetic	next_action
load 3 representative tenants	estimate	synthetic	next_action
replay 24 hours of production query	estimate	synthetic	next_action
0.86 confidence	estimate	synthetic	selection_rationale
0.85	estimate	synthetic	selection_rationale
>40% skew	threshold	synthetic	selection_rationale
>60% worker saturation	threshold	synthetic	selection_rationale
b003	estimate	synthetic	selection_rationale

Unknowns blocking a firmer verdict

Coordinator bottleneck at 2,000 tenants: killed branch b005 cited case studies from Framer and Heap showing coordinator hotspotting spiking p99 to 150ms+. This was auto-pruned as unsupported but the concern is architecturally valid and untested in this specific workload profile.
Cross-cloud migration complexity: if existing services are on AWS, moving the database to Azure introduces cross-cloud latency and data transfer costs not accounted for in the $4,200/month estimate.
The $4,200/month Azure cost and $28K/month DynamoDB cost are model-generated projections without cited production benchmarks for this specific workload volume.
No evidence that the current DynamoDB bottleneck has been formally diagnosed — b002/b007's concern that the problem may be implementation rather than technology remains valid.
Actual query patterns and data volume per tenant not specified — latency projections assume typical multi-tenant SaaS workloads.

Operational signals to watch

reversal — Candidate estimate (inferred, not source-confirmed): DynamoDB costs are primarily driven by implementation issues (poor partition key design, over-provisioned capacity) and a 30%+ cost reduction is achievable through optimization alone

reversal — Candidate estimate (inferred, not source-confirmed): Proof-of-concept shows coordinator bottleneck at 2,000 tenants causes p99 > 50ms under production-equivalent concurrent load

reversal — Candidate estimate (inferred, not source-confirmed): Existing infrastructure is entirely AWS-native and cross-cloud latency to Azure adds >10ms to p99, eating the safety margin

Branch battle map

Battle timeline (3 rounds)

Round 1 — Initial positions · 3 branches

Branch b001 (Vulcan) eliminated — Branch b001 proposes a hybrid DynamoDB + PostgreSQL/Citus...

Socrates proposed branch b004

Socrates Reframe the problem: Instead of asking whether to migrate from DynamoDB to Postg…

Round 2 — Adversarial probes · 3 branches

Loki proposed branch b005

Branch b005 (Loki) eliminated — auto-pruned: unsupported low-confidence branch

Socrates proposed branch b006

Branch b006 (Socrates) eliminated — auto-pruned: unsupported low-confidence branch

Loki Azure Cosmos DB for PostgreSQL (Hyperscale Citus) recommendation ignores real-wo…

Socrates Instead of a simple yes/no migration decision, we should evaluate whether a hybr…

Round 3 — Final convergence · 3 branches

Branch b004 (Socrates) eliminated — Branch b004 proposes a 'polyglot persistence strategy' wi...

Socrates proposed branch b007

Socrates Before considering migration, conduct a comprehensive database implementation au…

Markdown JSON

Council chamber

Vulcan

Engineer

Socrates

Analyst

Daedalus

Architect

Loki

Disruptor

cbfe26f4-4e75-417d-951d-0d3ef481fdd9 · Protocol

Council archetypes represent independent reasoning perspectives. They are not individuals but structured reasoning roles.

This verdict is a structured reasoning artifact, not professional advice. VectorCourt does not provide legal, financial, medical, or other professional advice. You are responsible for your own decisions.

VectorCourt · Pricing · Terms · Privacy