Build Up
What
FailOver란
실패 즉 장애를 의미한다.
AutoFailOver
실패했을 때 자동으로 원래 상태로 복구하는 것을 의미한다.
레디스 클러스터 모드를 설정했다고 해서 Auto FailOver를 자동으로 제공하는 것이 아니므로 이 부분을 핸들링 해야한다.
Why
- 24시간으로 운영되는 가운데 장애 발생시 사용자는 불편함을 느낄 수 있어 사용자 이탈률이 증가하며, 수동적으로 개발자가 대응하는데 시간이 소요되기 때문이다.
- 또한 개발자가 대응하는 사이 데이터 유실이 될 수도 있기 때문이다.
How
3개의 클러스터 구조를 이루고 있는 레디스에 장애를 의도
- 7002번 포트 의도적 장애내기
$ docker exec -it redis-master-2 bash $ redis-cli -c -p 7002 127.0.0.1:7002> debug segfault Error: Connection reset by peer
- 7002 번 로그 살펴보기
docker logs -f redis-master-2 * FAIL message received from 3c349984f0bb61490c170ab68f2617a35d9581d6 about 79816979a6dd4b226e476121dd385ed6c25e5151 # Cluster state changed: fail
- 7001번으로 클러스터 정보 살펴보기
$ docker exec -it redis-master-1 bash $ redis-cli -c -p 7001 127.0.0.1:7001> set a b (error) CLUSTERDOWN The cluster is down 127.0.0.1:7001> cluster info cluster_state:fail cluster_slots_assigned:16384 cluster_slots_ok:10922 cluster_slots_pfail:0 cluster_slots_fail:5462 cluster_known_nodes:3 cluster_size:3 cluster_current_epoch:3 cluster_my_epoch:1 cluster_stats_messages_ping_sent:435 cluster_stats_messages_pong_sent:461 cluster_stats_messages_sent:896 cluster_stats_messages_ping_received:459 cluster_stats_messages_pong_received:434 cluster_stats_messages_meet_received:2 cluster_stats_messages_fail_received:1 cluster_stats_messages_received:896 127.0.0.1:7001> cluster nodes 027a002ecc012b61a5997f151ad01bccbb65d1c0 127.0.0.1:7001@17001 myself,master - 0 1624260599000 1 connected 0-5460 3c349984f0bb61490c170ab68f2617a35d9581d6 127.0.0.1:7003@17003 master - 0 1624260599580 3 connected 10923-16383 79816979a6dd4b226e476121dd385ed6c25e5151 127.0.0.1:7002@17002 master,fail - 1624260495388 1624260493352 2 disconnected 5461-10922
127.0.0.1:7001> set a b
- master가 3개 미만인 상태라 데이터 삽입이 불가능하다.
cluster_state:fail
- cluster상태가 fail로 이용이 불가능하다.
127.0.0.1:7001> cluster nodes
- 노드들의 상태정보를 검색한 결과 7002번 master가 disconnected된 것을 확인할 수 있다.
- 7002번을 다시 재시작 해본다.
$ docker restart redis-master-2 redis-master-2 $ docker ps | grep redis-master-2 de5e52fb0428 redis:6.2.3 "docker-entrypoint.s…" 9 minutes ago Up 8 seconds redis-master-2
- 7001번으로 돌아가 클러스터 정보 살펴보기
$ docker exec -it redis-master-1 bash $ redis-cli -c -p 7001 127.0.0.1:7001> cluster nodes 027a002ecc012b61a5997f151ad01bccbb65d1c0 127.0.0.1:7001@17001 myself,master - 0 1624260765000 1 connected 0-5460 3c349984f0bb61490c170ab68f2617a35d9581d6 127.0.0.1:7003@17003 master - 0 1624260767224 3 connected 10923-16383 79816979a6dd4b226e476121dd385ed6c25e5151 127.0.0.1:7002@17002 master - 0 1624260766218 2 connected 5461-10922
- 클러스터 정상작동 확인
127.0.0.1:7001> set a b -> Redirected to slot [15495] located at 127.0.0.1:7003 OK 127.0.0.1:7003> get a "b" 127.0.0.1:7003> set b c -> Redirected to slot [3300] located at 127.0.0.1:7001 OK 127.0.0.1:7001> set c d -> Redirected to slot [7365] located at 127.0.0.1:7002 OK 127.0.0.1:7002> get c "d"
6개의 노드 장애 의도해보기
기대해보는 시나리오
m1의 노드가 죽으면 자동으로 slave노드가 m1의 역할로 프로모션하여 레디스 자체가 단일장애 지점이 되지 않는다라는 것을 기대함.

- 의도적인 노드 중지
docker stop redis-master-2
- slave log 확인하기
$ docker logs -f redis-slave-2 # Connection with master lost. * Caching the disconnected master state. * Reconnecting to MASTER 192.168.56.101:7001 * MASTER <-> REPLICA sync started # Error condition on socket for SYNC: Connection refused ... # Failover election won: I'm the new master. # Cluster state changed: ok
- 클러스터 정보 확인하기
$ docker exec -it redis-master-1 bash $ redis-cli -c -p 7001 127.0.0.1:7001> cluster nodes 30a99d668af3ddda16e2a9d3ee97fb53a5ebfa6d 192.168.56.100:7002@17002 myself,slave 22110f4ea10f11a8cb6ea283dedfc27c6ffabc07 0 1624434594000 3 connected 22110f4ea10f11a8cb6ea283dedfc27c6ffabc07 192.168.56.102:7001@17001 master - 0 1624434596533 3 connected 10923-16383 094af2ab1db0d147d7f475f3954429ae7d18dee0 192.168.56.102:7002@17002 master - 0 1624434595524 7 connected 5461-10922 c952f5ef4783b5c19129bc630b88e8e3bf602622 192.168.56.101:7001@17001 master,fail - 1624434574580 1624434573000 2 disconnected 5b56d458a0d8e64d5f40ece0a99713dcb9c70723 192.168.56.100:7001@17001 master - 0 1624434595524 1 connected 0-5460 e0d9ee09b593889cd093d217a16a0b535e6abef2 192.168.56.101:7002@17002 slave 5b56d458a0d8e64d5f40ece0a99713dcb9c70723 0 1624434595626 1 connected
094af2ab1db0d147d7f475f3954429ae7d18dee0 192.168.56.102:7002@17002 master
- 기존에 slave였던 192.168.56.102:7002가 master로 승격한 것을 확인할 수 있다.
c952f5ef4783b5c19129bc630b88e8e3bf602622 192.168.56.101:7001@17001 master,fail
- 장애가 발생한 master노드는 fail상태이고, disconnected 되어있다.
- 중지 노드 다시 활성하 시키기
$ docker restart redis-master-2
- 레디스 정보 다시 확인하기
$ docker exec -it redis-master-1 bash $ redis-cli -c -p 7001 127.0.0.1:7001> cluster nodes 30a99d668af3ddda16e2a9d3ee97fb53a5ebfa6d 192.168.56.100:7002@17002 myself,slave 22110f4ea10f11a8cb6ea283dedfc27c6ffabc07 0 1624434646000 3 connected 22110f4ea10f11a8cb6ea283dedfc27c6ffabc07 192.168.56.102:7001@17001 master - 0 1624434646576 3 connected 10923-16383 094af2ab1db0d147d7f475f3954429ae7d18dee0 192.168.56.102:7002@17002 master - 0 1624434646071 7 connected 5461-10922 c952f5ef4783b5c19129bc630b88e8e3bf602622 192.168.56.101:7001@17001 slave 094af2ab1db0d147d7f475f3954429ae7d18dee0 0 1624434647079 7 connected 5b56d458a0d8e64d5f40ece0a99713dcb9c70723 192.168.56.100:7001@17001 master - 0 1624434646575 1 connected 0-5460 e0d9ee09b593889cd093d217a16a0b535e6abef2 192.168.56.101:7002@17002 slave 5b56d458a0d8e64d5f40ece0a99713dcb9c70723 0 1624434646576 1 connected
docker compose redis 구조
redis-node1: platform: linux/x86_64 # m1 MacOS의 경우 image: redis:6.2 container_name: redis-node1 volumes: # 작성한 설정 파일을 볼륨을 통해 컨테이너에 공유 - ./redis-cluster/redis.conf:/usr/local/etc/redis/redis.conf command: redis-server /usr/local/etc/redis/redis.conf ports: - "6380:6380" - "6381:6381" - "6379:6379" - "6382:6382" - "6383:6383" - "6384:6384" redis-node2: network_mode: "service:redis-node1" platform: linux/x86_64 image: redis:6.2 container_name: redis-node2 volumes: - ./redis-cluster/redis1.conf:/usr/local/etc/redis/redis.conf command: redis-server /usr/local/etc/redis/redis.conf redis-node3: network_mode: "service:redis-node1" platform: linux/x86_64 image: redis:6.2 container_name: redis-node3 volumes: - ./redis-cluster/redis2.conf:/usr/local/etc/redis/redis.conf command: redis-server /usr/local/etc/redis/redis.conf redis-slave1: network_mode: "service:redis-node1" platform: linux/x86_64 image: redis:6.2 container_name: redis-slave1 volumes: - ./redis-cluster/redis-slave1.conf:/usr/local/etc/redis/redis.conf command: redis-server /usr/local/etc/redis/redis.conf redis-slave2: network_mode: "service:redis-node1" platform: linux/x86_64 image: redis:6.2 container_name: redis-slave2 volumes: - ./redis-cluster/redis-slave2.conf:/usr/local/etc/redis/redis.conf command: redis-server /usr/local/etc/redis/redis.conf redis-slave3: network_mode: "service:redis-node1" platform: linux/x86_64 image: redis:6.2 container_name: redis-slave3 volumes: - ./redis-cluster/redis-slave3.conf:/usr/local/etc/redis/redis.conf command: redis-server /usr/local/etc/redis/redis.conf redis-cluster-entry: network_mode: "service:redis-node1" platform: linux/x86_64 image: redis:6.2 container_name: redis-cluster-entry command: redis-cli --cluster create 127.0.0.1:6379 127.0.0.1:6380 127.0.0.1:6381 127.0.0.1:6382 127.0.0.1:6383 127.0.0.1:6384 --cluster-yes --cluster-replicas 1 depends_on: - redis-node1 - redis-node2 - redis-node3 restart: on-failure