Step-by-Step Redis Cluster Slot Migration Guide

Redis Cluster partitions the keyspace across 16,384 deterministic hash slots. Horizontal scaling requires redistributing these slots across newly provisioned or underutilized nodes without disrupting live traffic. Because the architecture relies on a decentralized gossip protocol and strict key hashing, any misalignment during the handoff triggers MOVED or ASK redirect storms that can exhaust connection pools and degrade application latency. Executing a Zero-Downtime Slot Migration requires treating the operation as a stateful, multi-phase transition rather than a bulk data transfer. The following guide details production-tested diagnostics, CLI orchestration, client-side resilience patterns, and CI/CD gating for Redis 6.2+ and 7.x environments.

Pre-Migration Diagnostics and Baseline Validation

Before initiating any topology change, establish a deterministic health baseline. Uneven memory distribution, CPU saturation, or orphaned slots will amplify migration latency and increase the probability of partial failures.

1. Verify Slot Ownership and Topology Consistency

redis-cli -h <any-master> -p <port> --cluster check <any-master>:<port>
redis-cli -h <any-master> -p <port> CLUSTER SLOTS
redis-cli -h <any-master> -p <port> CLUSTER NODES

Cross-reference output to confirm no slot is marked ERR or assigned to multiple masters. Mismatched configuration epochs (configEpoch) indicate stale gossip states. Resolve with:

redis-cli -h <affected-node> -p <port> CLUSTER SET-CONFIG-EPOCH <new-epoch>

2. Validate Critical Cluster Parameters

Ensure the following are explicitly configured in redis.conf or via CONFIG SET:

  • cluster-node-timeout 15000 — prevents premature failover during high-latency MIGRATE operations
  • cluster-require-full-coverage no — allows the cluster to continue serving unaffected slots if a migration temporarily blocks a subset
  • cluster-migration-barrier 1 — maintains replica availability during master slot transfers

3. Identify Hot-Key Skew

Run redis-cli --bigkeys or sample MEMORY USAGE on high-throughput keys. If a single slot contains >15% of a node's memory footprint, split the migration into smaller batches to avoid blocking the source event loop.

Orchestrating the Slot Handoff

Slot migration is governed by three cluster states: MIGRATING (source), IMPORTING (target), and the post-handoff stable state. The redis-cli --cluster reshard utility automates the underlying CLUSTER SETSLOT and MIGRATE commands.

Incremental Migration Pattern

Never migrate more than 500 slots in a single execution for datasets exceeding 10 GB. Use the following sequence:

redis-cli --cluster reshard <target-host>:<target-port> \
  --cluster-from <source-node-id> \
  --cluster-to <target-node-id> \
  --cluster-slots 200 \
  --cluster-yes

During execution, the source node transitions to MIGRATING. Queries for keys in the target range that have already moved return ASK <slot> <target-host>:<target-port>. The client must follow the ASK redirect, issue an ASKING command, and retry the original operation. Once all keys are transferred, redis-cli --cluster reshard issues CLUSTER SETSLOT <slot> NODE <dest_node_id> to both the destination and source nodes to assign the slot to its new owner, and gossip propagates the updated ownership map.

CLUSTER SETSLOT <slot> STABLE only clears the transient MIGRATING/IMPORTING state — it does not transfer ownership and should only be used to abort a stalled migration.

sequenceDiagram
    participant C as Client
    participant Src as Source node
    participant Dst as Target node
    C->>Src: GET key (slot migrating)
    Src-->>C: ASK slot dst
    C->>Dst: ASKING
    C->>Dst: GET key
    Dst-->>C: value

For Redis 7.x environments, leverage MIGRATE with AUTH2 if ACLs are enforced, and inspect MIGRATE return values alongside redis-cli --cluster check output to detect network timeouts or key serialization errors.

Client-Side Resilience and Python Retry Patterns

Native cluster clients handle MOVED redirects automatically, but ASK redirects also require the ASKING prefix before the retried command. redis-py handles both automatically via RedisCluster. The code below illustrates the underlying mechanics for custom routing layers:

import redis
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

class RedisClusterMigrationClient:
    def __init__(self, nodes):
        self.client = redis.RedisCluster(startup_nodes=nodes, decode_responses=True)

    @retry(
        stop=stop_after_attempt(5),
        wait=wait_exponential(multiplier=0.1, min=0.1, max=2),
        retry=retry_if_exception_type((redis.exceptions.AskError, redis.exceptions.ConnectionError)),
        reraise=True,
    )
    def safe_set(self, key: str, value: str) -> bool:
        try:
            return self.client.set(key, value)
        except redis.exceptions.AskError as e:
            # Extract target node from ASK response
            slot, target = str(e).split()
            host, port = target.rsplit(":", 1)
            ask_conn = redis.Redis(host=host, port=int(port), decode_responses=True)
            ask_conn.execute_command("ASKING")
            return ask_conn.set(key, value)

For high-throughput services, configure redis-py with retry_on_timeout=True and socket_timeout=2.0 to prevent thread pool starvation during gossip propagation delays. Refer to the official redis-py documentation for cluster client initialization best practices.

CI/CD Gating and Pipeline Automation

Automate migration safety checks to prevent topology corruption during automated deployments.

jobs:
  redis-slot-migration:
    runs-on: ubuntu-latest
    steps:
      - name: Pre-flight Cluster Health Check
        run: |
          redis-cli --cluster check "$REDIS_ENDPOINT" | grep -q "All 16384 slots covered" || exit 1
          redis-cli -h "$REDIS_ENDPOINT" CLUSTER INFO | grep -q "cluster_state:ok" || exit 1

      - name: Canary Slot Redistribution (50 slots)
        run: |
          redis-cli --cluster reshard "$TARGET_NODE" \
            --cluster-from "$SOURCE_ID" \
            --cluster-to "$TARGET_ID" \
            --cluster-slots 50 \
            --cluster-yes \
            --cluster-timeout 10000

      - name: Post-Migration Validation Gate
        run: |
          sleep 30  # Allow gossip convergence
          redis-cli --cluster check "$REDIS_ENDPOINT" > cluster_report.txt
          grep -q "\[OK\] All 16384 slots covered" cluster_report.txt || {
            echo "FAIL: slot coverage gap"; exit 1;
          }
          grep "\[ERR\]" cluster_report.txt && {
            echo "FAIL: cluster check reported errors"; exit 1;
          } || true
          echo "PASS: Migration complete"

Enforce mandatory manual approval gates for migrations exceeding 1,000 slots. Integrate Prometheus redis_cluster_slots_assigned and redis_cluster_slots_ok metrics into your deployment dashboard to trigger automated rollbacks if slot coverage drops below 100% for more than 60 seconds.

Post-Migration Verification and Telemetry

After the final slot batch is committed, validate topology consistency and monitor convergence metrics:

redis-cli --cluster check <any-node>:<port>
redis-cli -h <any-node> -p <port> CLUSTER NODES | grep "master" | awk '{print $2, $9}' | sort

Confirm that cluster_stats_messages_received and cluster_stats_messages_sent stabilize within 5–10 seconds, indicating gossip convergence. Watch cluster_slots_pfail/cluster_slots_fail in CLUSTER INFO and re-run redis-cli --cluster check to surface persistent network bottlenecks or stalled slots. For teams scaling beyond single-region deployments, review the broader architectural patterns documented in Redis Cluster Scaling, Sharding & Automation to align slot distribution with cross-AZ latency requirements.

Maintain a 24-hour observation window before decommissioning legacy nodes. Monitor connection pool utilization, latency_percentiles_usec from LATENCY HISTORY, and replica sync lag (master_repl_offset delta) to ensure the cluster has fully absorbed the new topology.