should we replace Redis with Valkey now that Redis changed its license, or stay on Redis for a 200-node deployment handling 2M ops/sec?
This verdict assumes 50% of constraints
The following constraints were not provided and default values were used:
- team_size: standard team (5-10 engineers) (not_addressed)
- existing_stack: greenfield assumed (not_addressed)
- connection_pooler: not specified (not_addressed)
- current_state: not specified (not_addressed)
- rollback_plan: not specified (not_addressed)
- data_volume: not specified (not_addressed)
Execute a phased canary migration from Redis to Valkey 7.2.x using a dual-write proxy pattern over 4 months
Decision
Execute a phased canary migration from Redis to Valkey 7.2.x over 4 months using a dual-write proxy pattern (Envoy with redis_proxy filter or Twemproxy). Phase 1: Stand up a 20-node Valkey canary (10% of fleet) receiving shadow writes while Redis serves all reads. Phase 2: Shift reads for session cache workload to Valkey canary, validating p99 ≤2ms and cache hit ratio ≥85%. Phase 3: Expand to 100 Valkey nodes at 50% traffic. Phase 4: Full 200-node cutover with Redis kept warm for 2-week rollback. Abort if: Valkey p99 exceeds 3ms, cluster gossip exceeds 100 Mbps, pub/sub latency exceeds 5ms, or more than 2 node failures in any 7-day canary window. Key failure mode: pub/sub at 200 nodes broadcasts to all cluster members — if real-time events exceed 100K messages/sec, internal bandwidth saturates. Mitigation: isolate pub/sub onto a dedicated 16-node cluster. Second failure mode: cluster rebalancing storms from 16,384 hash slots during node topology changes. Budget: $50K total. This avoids the $400K-$600K/year Redis Enterprise licensing cost and the security risk of staying on Redis 7.2 (last Apache-2.0 version) as patches shift to 7.4+.
Next actions
Council notes
Evidence boundary
Observed from your filing
- should we replace Redis with Valkey now that Redis changed its license, or stay on Redis for a 200-node deployment handling 2M ops/sec?
Assumptions used for analysis
- Valkey 7.2.x is API-compatible with the Redis commands and data structures currently used across the 200-node deployment — no custom Redis modules or RESP3-specific features that Valkey hasn't forked
- The existing deployment runs Redis 7.2 or earlier (last Apache-2.0 version) and has not yet upgraded to Redis 7.4+ under the new SSPL/RSAL license
- Cloud infrastructure can provision 20 additional nodes for canary without exceeding quota or budget approval timelines
- The 2M ops/sec workload is distributed across session cache, rate limiting, and pub/sub — not a single monolithic use case that cannot be decomposed for phased migration
- Envoy with redis_proxy filter can handle the dual-write throughput at the required proxy layer without becoming a bottleneck itself
- team size defaulted: standard team (5-10 engineers) (not_addressed)
- existing stack defaulted: greenfield assumed (not_addressed)
- connection pooler defaulted: not specified (not_addressed)
- current state defaulted: not specified (not_addressed)
- rollback plan defaulted: not specified (not_addressed)
- data volume defaulted: not specified (not_addressed)
Inferred candidate specifics
- Execute a phased canary migration from Redis to Valkey 7.2.x over 4 months using a dual-write proxy pattern (Envoy with redis_proxy filter or Twemproxy). Phase 1: Stand up a 20-node Valkey canary (10% of fleet) receiving shadow writes while Redis serves all reads. Phase 2: Shift reads for session cache workload to Valkey canary, validating p99 ≤2ms and cache hit ratio ≥85%. Phase 3: Expand to 100 Valkey nodes at 50% traffic. Phase 4: Full 200-node cutover with Redis kept warm for 2-week rollback. Abort if: Valkey p99 exceeds 3ms, cluster gossip exceeds 100 Mbps, pub/sub latency exceeds 5ms, or more than 2 node failures in any 7-day canary window. Key failure mode: pub/sub at 200 nodes broadcasts to all cluster members — if real-time events exceed 100K messages/sec, internal bandwidth saturates. Mitigation: isolate pub/sub onto a dedicated 16-node cluster. Second failure mode: cluster rebalancing storms from 16,384 hash slots during node topology changes. Budget: $50K total. This avoids the $400K-$600K/year Redis Enterprise licensing cost and the security risk of staying on Redis 7.2 (last Apache-2.0 version) as patches shift to 7.4+.
- Deploy a 20-node Valkey 7.2.6 canary cluster in the same availability zone as the existing Redis deployment, configure Envoy with redis_proxy filter for dual-write from 10% of the production write path, and instrument Prometheus/Grafana dashboards tracking p99 latency, gossip bandwidth, pub/sub delivery latency, and node failure rate against the four abort thresholds.
- b003 had the highest confidence (0.90) among surviving branches, survived 3 rounds of adversarial challenge including a direct attack on dual-write feasibility (b004, killed), and provided the most concrete architecture: named proxy technology (Envoy redis_proxy), specific phase timeline, quantified abort thresholds, named failure modes with mitigations, and a budget breakdown. b002 (0.70) was a strictly weaker version of the same recommendation without the specificity.
- Hybrid architecture with Valkey at edge and commercial caching (ElastiCache) for critical workloads
- Architecturally incoherent — ElastiCache IS Redis/Valkey under the hood. Introduced cache coherence problems at 2M ops/sec without naming a consistency protocol. Claimed p99 of 1.5ms while adding a synchronization layer, violating basic latency math. Fabricated budget constraints.
- Treat as a legal/contractual issue, negotiate commercial Redis license before any migration
- SSPL/RSAL is a blanket license change, not negotiable per-customer. Redis Enterprise for 200 nodes would cost $400K-$600K/year vs. $50K one-time migration. Backup options (KeyDB unmaintained since 2022, DragonflyDB uses BSL 1.1) have the same or worse license problems. Delay accumulates unpatched CVE exposure on Redis 7.2.
- Reject dual-write as introducing insurmountable consistency risks and >10ms p99 spikes
Inferred specifics table
| Value | Kind | Basis | Where introduced |
|---|---|---|---|
| Valkey 7.2 | version | synthetic | chosen_path |
| Redis 7.2 | version | synthetic | chosen_path |
| to 7.4+ | version | synthetic | chosen_path |
| x over 4 months using a dual-write | estimate | synthetic | chosen_path |
| Phase 1: Stand up a 20-node Valkey canary | estimate | synthetic | chosen_path |
| 10% of fleet | threshold | synthetic | chosen_path |
| 2 | estimate | synthetic | chosen_path |
| validating p99 ≤2ms and cache hit ratio ≥85% | threshold | synthetic | chosen_path |
| Phase 3: Expand to 100 Valkey | estimate | synthetic | chosen_path |
| Expand to 100 Valkey nodes at 50% | threshold | synthetic | chosen_path |
| nodes at 50% traffic | threshold | synthetic | chosen_path |
| Abort if: Valkey p99 exceeds 3ms | threshold | synthetic | chosen_path |
| pub/sub latency exceeds 5ms | threshold | synthetic | chosen_path |
| to Valkey 7 | estimate | synthetic | chosen_path |
| events exceed 100K messages/sec | estimate | synthetic | chosen_path |
| Mitigation: isolate pub/sub onto a dedicated 16-node cluster | estimate | synthetic | chosen_path |
| 384 hash slots during node topology changes | estimate | synthetic | chosen_path |
| Budget: $50K total | estimate | synthetic | chosen_path |
| avoids the $400K-$600K/year Redis Enterprise licensing cost | technology | synthetic | chosen_path |
| last Apache-2.0 version | estimate | synthetic | chosen_path |
Unknowns blocking a firmer verdict
- Valkey 7.2.x cluster behavior at exactly 200 nodes is not widely benchmarked in public literature — the gossip bandwidth and rebalancing storm thresholds are engineering estimates, not production-validated numbers at this specific scale
- b003's budget of $50K is a rough estimate — actual costs depend heavily on cloud provider, instance types, and whether reserved/spot pricing is available for the canary phase
- The pub/sub 100K messages/sec threshold for bandwidth saturation is model-derived, not benchmarked against Valkey's specific cluster broadcast implementation
- Redis 7.2 security patch timeline is uncertain — Redis Ltd may continue critical CVE patches longer than expected, or may not
- b004 (killed) raised a valid concern about dual-write consistency during network partitions that b003 addresses only via abort thresholds, not via a formal consistency protocol