ちゃんるいすのブログ

オタクエンジニアの雑記

InnoDB Cluster が Split-Brain で死んだ時


 MySQL  db03:33060+ ssl  JS > c.status()
{
    "clusterName": "main",
    "defaultReplicaSet": {
        "name": "default",
        "primary": "db03.luis.local:3306",
        "ssl": "REQUIRED",
        "status": "NO_QUORUM",
        "statusText": "Cluster has no quorum as visible from 'db03.luis.local:3306' and cannot process write transactions. 2 members are not active",
        "topology": {
            "db01.luis.local:3306": {
                "address": "db01.luis.local:3306",
                "mode": "R/O",
                "readReplicas": {},
                "role": "HA",
                "status": "(MISSING)"
            },
            "db02.luis.local:3306": {
                "address": "db02.luis.local:3306",
                "mode": "R/O",
                "readReplicas": {},
                "role": "HA",
                "status": "UNREACHABLE",
                "version": "8.0.21"
            },
            "db03.luis.local:3306": {
                "address": "db03.luis.local:3306",
                "mode": "R/O",
                "readReplicas": {},
                "replicationLag": null,
                "role": "HA",
                "status": "ONLINE",
                "version": "8.0.21"
            }
        },
        "topologyMode": "Single-Primary"
    },
    "groupInformationSourceMember": "db03.luis.local:3306"
}


rejoin しようが何しようが有効な quorum がないためエラーになる。

 MySQL  db03:33060+ ssl  JS > c.rejoinInstance('db03.luis.local:3306')
Cluster.rejoinInstance: There is no quorum to perform the operation (RuntimeError)

有効なノードが1台居るから(db03.luis.local:3306)ここを元に Cluster を再作成する。

 MySQL  db03:33060+ ssl  JS > c.forceQuorumUsingPartitionOf('root@db03.luis.local:3306', 'password')
Restoring cluster 'main' from loss of quorum, by using the partition composed of [db03.luis.local:3306]

Restoring the InnoDB cluster ...

The InnoDB cluster was successfully restored using the partition from the instance 'root@db03.luis.local:3306'.

WARNING: To avoid a split-brain scenario, ensure that all other members of the cluster are removed or joined back to the group that was restored.

稀に失敗することがあるけど数分後待つと通る(もしかして、status: UNREACHABLE が悪い?)
数分待って通ったのは db02.luis.local:3306 への疎通が通ったからかな?

 MySQL  db03:33060+ ssl  JS > c.forceQuorumUsingPartitionOf('root@db03.luis.local:3306', 'password')
Restoring cluster 'main' from loss of quorum, by using the partition composed of [db03.luis.local:3306]

Restoring the InnoDB cluster ...

Cluster.forceQuorumUsingPartitionOf: db03.luis.local:3306: Variable 'group_replication_force_members' can't be set to the value of 'db03.luis.local:33061' (RuntimeError)

NO_QUORUM は解決できた

 MySQL  db03:33060+ ssl  JS > c.status()
{
    "clusterName": "main",
    "defaultReplicaSet": {
        "name": "default",
        "primary": "db03.luis.local:3306",
        "ssl": "REQUIRED",
        "status": "OK_NO_TOLERANCE",
        "statusText": "Cluster is NOT tolerant to any failures. 2 members are not active",
        "topology": {
            "db01.luis.local:3306": {
                "address": "db01.luis.local:3306",
                "mode": "R/O",
                "readReplicas": {},
                "role": "HA",
                "status": "(MISSING)"
            },
            "db02.luis.local:3306": {
                "address": "db02.luis.local:3306",
                "mode": "R/O",
                "readReplicas": {},
                "role": "HA",
                "status": "(MISSING)"
            },
            "db03.luis.local:3306": {
                "address": "db03.luis.local:3306",
                "mode": "R/W",
                "readReplicas": {},
                "replicationLag": null,
                "role": "HA",
                "status": "ONLINE",
                "version": "8.0.21"
            }
        },
        "topologyMode": "Single-Primary"
    },
    "groupInformationSourceMember": "db03.luis.local:3306"
}

あとは MISSING なノードを rejoin させる

 MySQL  db03:33060+ ssl  JS > c.rejoinInstance('root@db01.luis.local')
Rejoining the instance to the InnoDB cluster. Depending on the original
problem that made the instance unavailable, the rejoin operation might not be
successful and further manual steps will be needed to fix the underlying
problem.

Please monitor the output of the rejoin operation and take necessary action if
the instance cannot rejoin.

Rejoining instance to the cluster ...

The instance 'db01.luis.local' was successfully rejoined on the cluster.

 MySQL  db03:33060+ ssl  JS > c.rejoinInstance('root@db02.luis.local')
Rejoining the instance to the InnoDB cluster. Depending on the original
problem that made the instance unavailable, the rejoin operation might not be
successful and further manual steps will be needed to fix the underlying
problem.

Please monitor the output of the rejoin operation and take necessary action if
the instance cannot rejoin.

Rejoining instance to the cluster ...

The instance 'db02.luis.local' was successfully rejoined on the cluster.

 MySQL  db03:33060+ ssl  JS > c.status()
{
    "clusterName": "main",
    "defaultReplicaSet": {
        "name": "default",
        "primary": "db03.luis.local:3306",
        "ssl": "REQUIRED",
        "status": "OK",
        "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",
        "topology": {
            "db01.luis.local:3306": {
                "address": "db01.luis.local:3306",
                "mode": "R/O",
                "readReplicas": {},
                "recovery": {
                    "state": "ON"
                },
                "recoveryStatusText": "Distributed recovery in progress",
                "role": "HA",
                "status": "RECOVERING",
                "version": "8.0.21"
            },
            "db02.luis.local:3306": {
                "address": "db02.luis.local:3306",
                "mode": "R/O",
                "readReplicas": {},
                "recovery": {
                    "cloneStartTime": "2020-04-11 12:37:09.240",
                    "cloneState": "Completed",
                    "currentStage": "RECOVERY",
                    "currentStageState": "Completed"
                },
                "recoveryStatusText": "Cloning in progress",
                "role": "HA",
                "status": "RECOVERING",
                "version": "8.0.21"
            },
            "db03.luis.local:3306": {
                "address": "db03.luis.local:3306",
                "mode": "R/W",
                "readReplicas": {},
                "replicationLag": null,
                "role": "HA",
                "status": "ONLINE",
                "version": "8.0.21"
            }
        },
        "topologyMode": "Single-Primary"
    },
    "groupInformationSourceMember": "db03.luis.local:3306"
}