Step-by-Step Redis Cluster Slot Migration Guide
Redis Cluster partitions the keyspace across 16,384 deterministic hash slots. Horizontal scaling requires redistributing these slots across newly provisioned or underutilized nodes without disrupting live traffic. Because the architecture relies on a decentralized gossip protocol and strict key hashing, any misalignment during the handoff triggers MOVED or ASK redirect storms that can exhaust connection pools and degrade application latency. Executing a Zero-Downtime Slot Migration requires treating the operation as a stateful, multi-phase transition rather than a bulk data transfer. The following guide details production-tested diagnostics, CLI orchestration, client-side resilience patterns, and CI/CD gating for Redis 6.2+ and 7.x environments.
Pre-Migration Diagnostics and Baseline Validation
Before initiating any topology change, establish a deterministic health baseline. Uneven memory distribution, CPU saturation, or orphaned slots will amplify migration latency and increase the probability of partial failures.
1. Verify Slot Ownership and Topology Consistency
redis-cli -h <any-master> -p <port> --cluster check <any-master>:<port>
redis-cli -h <any-master> -p <port> CLUSTER SLOTS
redis-cli -h <any-master> -p <port> CLUSTER NODES
Cross-reference output to confirm no slot is marked ERR or assigned to multiple masters. Mismatched configuration epochs (configEpoch) indicate stale gossip states. Resolve with:
redis-cli -h <affected-node> -p <port> CLUSTER SET-CONFIG-EPOCH <new-epoch>
2. Validate Critical Cluster Parameters
Ensure the following are explicitly configured in redis.conf or via CONFIG SET:
cluster-node-timeout 15000— prevents premature failover during high-latencyMIGRATEoperationscluster-require-full-coverage no— allows the cluster to continue serving unaffected slots if a migration temporarily blocks a subsetcluster-migration-barrier 1— maintains replica availability during master slot transfers
3. Identify Hot-Key Skew
Run redis-cli --bigkeys or sample MEMORY USAGE on high-throughput keys. If a single slot contains >15% of a node's memory footprint, split the migration into smaller batches to avoid blocking the source event loop.
Orchestrating the Slot Handoff
Slot migration is governed by three cluster states: MIGRATING (source), IMPORTING (target), and the post-handoff stable state. The redis-cli --cluster reshard utility automates the underlying CLUSTER SETSLOT and MIGRATE commands.
Incremental Migration Pattern
Never migrate more than 500 slots in a single execution for datasets exceeding 10 GB. Use the following sequence:
redis-cli --cluster reshard <target-host>:<target-port> \
--cluster-from <source-node-id> \
--cluster-to <target-node-id> \
--cluster-slots 200 \
--cluster-yes
During execution, the source node transitions to MIGRATING. Queries for keys in the target range that have already moved return ASK <slot> <target-host>:<target-port>. The client must follow the ASK redirect, issue an ASKING command, and retry the original operation. Once all keys are transferred, redis-cli --cluster reshard issues CLUSTER SETSLOT <slot> NODE <dest_node_id> to both the destination and source nodes to assign the slot to its new owner, and gossip propagates the updated ownership map.
CLUSTER SETSLOT <slot> STABLE only clears the transient MIGRATING/IMPORTING state — it does not transfer ownership and should only be used to abort a stalled migration.
sequenceDiagram
participant C as Client
participant Src as Source node
participant Dst as Target node
C->>Src: GET key (slot migrating)
Src-->>C: ASK slot dst
C->>Dst: ASKING
C->>Dst: GET key
Dst-->>C: value
For Redis 7.x environments, leverage MIGRATE with AUTH2 if ACLs are enforced, and inspect MIGRATE return values alongside redis-cli --cluster check output to detect network timeouts or key serialization errors.
Client-Side Resilience and Python Retry Patterns
Native cluster clients handle MOVED redirects automatically, but ASK redirects also require the ASKING prefix before the retried command. redis-py handles both automatically via RedisCluster. The code below illustrates the underlying mechanics for custom routing layers:
import redis
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
class RedisClusterMigrationClient:
def __init__(self, nodes):
self.client = redis.RedisCluster(startup_nodes=nodes, decode_responses=True)
@retry(
stop=stop_after_attempt(5),
wait=wait_exponential(multiplier=0.1, min=0.1, max=2),
retry=retry_if_exception_type((redis.exceptions.AskError, redis.exceptions.ConnectionError)),
reraise=True,
)
def safe_set(self, key: str, value: str) -> bool:
try:
return self.client.set(key, value)
except redis.exceptions.AskError as e:
# Extract target node from ASK response
slot, target = str(e).split()
host, port = target.rsplit(":", 1)
ask_conn = redis.Redis(host=host, port=int(port), decode_responses=True)
ask_conn.execute_command("ASKING")
return ask_conn.set(key, value)
For high-throughput services, configure redis-py with retry_on_timeout=True and socket_timeout=2.0 to prevent thread pool starvation during gossip propagation delays. Refer to the official redis-py documentation for cluster client initialization best practices.
CI/CD Gating and Pipeline Automation
Automate migration safety checks to prevent topology corruption during automated deployments.
jobs:
redis-slot-migration:
runs-on: ubuntu-latest
steps:
- name: Pre-flight Cluster Health Check
run: |
redis-cli --cluster check "$REDIS_ENDPOINT" | grep -q "All 16384 slots covered" || exit 1
redis-cli -h "$REDIS_ENDPOINT" CLUSTER INFO | grep -q "cluster_state:ok" || exit 1
- name: Canary Slot Redistribution (50 slots)
run: |
redis-cli --cluster reshard "$TARGET_NODE" \
--cluster-from "$SOURCE_ID" \
--cluster-to "$TARGET_ID" \
--cluster-slots 50 \
--cluster-yes \
--cluster-timeout 10000
- name: Post-Migration Validation Gate
run: |
sleep 30 # Allow gossip convergence
redis-cli --cluster check "$REDIS_ENDPOINT" > cluster_report.txt
grep -q "\[OK\] All 16384 slots covered" cluster_report.txt || {
echo "FAIL: slot coverage gap"; exit 1;
}
grep "\[ERR\]" cluster_report.txt && {
echo "FAIL: cluster check reported errors"; exit 1;
} || true
echo "PASS: Migration complete"
Enforce mandatory manual approval gates for migrations exceeding 1,000 slots. Integrate Prometheus redis_cluster_slots_assigned and redis_cluster_slots_ok metrics into your deployment dashboard to trigger automated rollbacks if slot coverage drops below 100% for more than 60 seconds.
Post-Migration Verification and Telemetry
After the final slot batch is committed, validate topology consistency and monitor convergence metrics:
redis-cli --cluster check <any-node>:<port>
redis-cli -h <any-node> -p <port> CLUSTER NODES | grep "master" | awk '{print $2, $9}' | sort
Confirm that cluster_stats_messages_received and cluster_stats_messages_sent stabilize within 5–10 seconds, indicating gossip convergence. Watch cluster_slots_pfail/cluster_slots_fail in CLUSTER INFO and re-run redis-cli --cluster check to surface persistent network bottlenecks or stalled slots. For teams scaling beyond single-region deployments, review the broader architectural patterns documented in Redis Cluster Scaling, Sharding & Automation to align slot distribution with cross-AZ latency requirements.
Maintain a 24-hour observation window before decommissioning legacy nodes. Monitor connection pool utilization, latency_percentiles_usec from LATENCY HISTORY, and replica sync lag (master_repl_offset delta) to ensure the cluster has fully absorbed the new topology.