Document a fix for a cluster failure caused by an IP change. There are two clusters, one is a single node (allinone) and the other is a four node (3 master 1 node) cluster.

1. Update Etcd certificate

  • Backup Etcd certificate at each Etcd node.

    1
    
    cp -R /etc/ssl/etcd/ssl /etc/ssl/etcd/ssl-bak
    
  • View the domain in the Etcd certificate

    1
    2
    3
    
    openssl x509 -in /etc/ssl/etcd/ssl/node-node1.pem -noout -text|grep DNS
    
                    DNS:etcd, DNS:etcd.kube-system, DNS:etcd.kube-system.svc, DNS:etcd.kube-system.svc.cluster.local, DNS:localhost, DNS:node1, IP Address:127.0.0.1, IP Address:0:0:0:0:0:0:0:1, IP Address:x.x.x.1
    

    All DNS and IP values need to be recorded and used to generate new certificates.

  • Clean up old Etcd certificates at each Etcd node

    1
    
    rm -f /etc/ssl/etcd/ssl/*
    
  • Generate Etcd certificate configuration in an Etcd node.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    
    vim /etc/ssl/etcd/ssl/openssl.conf
    
    [req]
    req_extensions = v3_req
    distinguished_name = req_distinguished_name
    
    [req_distinguished_name]
    
    [ v3_req ]
    basicConstraints = CA:FALSE
    keyUsage = nonRepudiation, digitalSignature, keyEncipherment
    subjectAltName = @alt_names
    
    [ ssl_client ]
    extendedKeyUsage = clientAuth, serverAuth
    basicConstraints = CA:FALSE
    subjectKeyIdentifier=hash
    authorityKeyIdentifier=keyid,issuer
    subjectAltName = @alt_names
    
    [ v3_ca ]
    basicConstraints = CA:TRUE
    keyUsage = nonRepudiation, digitalSignature, keyEncipherment
    subjectAltName = @alt_names
    authorityKeyIdentifier=keyid:always,issuer
    
    [alt_names]
    DNS.1 = localhost
    DNS.2 = etcd.kube-system.svc.cluster.local
    DNS.3 = etcd.kube-system.svc
    DNS.4 = etcd.kube-system
    DNS.5 = etcd
    DNS.6 = xxx
    IP.1 = 127.0.0.1
    IP.2 = x.x.x.x
    

    The hostname and IP address of all deployed Etcd nodes need to be included.

  • Generate Etcd’s CA certificate at an Etcd node

    1
    2
    3
    
    cd /etc/ssl/etcd/ssl
    openssl genrsa -out ca-key.pem 2048
    openssl req -x509 -new -nodes -key ca-key.pem -days 3650 -out ca.pem -subj "/CN=etcd-ca"
    
  • Generate an Etcd Admin certificate for each node in an Etcd node.

    Generate a certificate for each node by setting different environment variables with export host=node1. Here node1 is the host name, keep it the same as before to avoid not finding the certificate due to name change.

    1
    2
    3
    
    openssl genrsa -out admin-${host}-key.pem 2048
    openssl req -new -key admin-${host}-key.pem -out admin-${host}.csr -subj "/CN=etcd-admin-${host}"
    openssl x509 -req -in admin-${host}.csr -CA ca.pem -CAkey ca-key.pem -CAcreateserial -out admin-${host}.pem -days 3650 -extensions ssl_client  -extfile openssl.conf
    
  • Generate Etcd Member certificates for each node in an Etcd node.

    Switch nodes by export host=node1 and generate certificates for each node.

    1
    2
    3
    
    openssl genrsa -out member-${host}-key.pem 2048
    openssl req -new -key member-${host}-key.pem -out member-${host}.csr -subj "/CN=etcd-member-${host}" -config openssl.conf
    openssl x509 -req -in member-${host}.csr -CA ca.pem -CAkey ca-key.pem -CAcreateserial -out member-${host}.pem -days 3650 -extensions ssl_client -extfile openssl.conf
    
  • Certificates generated at an Etcd node distribution

    The certificate under /etc/ssl/etcd/ssl/ needs to be distributed to each Etcd node.

  • View etcd configuration in an Etcd node

    Here Etcd is started as a binary, and the location of the etcd configuration file can be found in systemd.

    1
    2
    3
    4
    
    cat /etc/systemd/system/etcd.service
    
    ...
    EnvironmentFile=/etc/etcd.env
    
  • Replacement IP per Etcd node

    Since there are multiple Etcd nodes, it is necessary to replace multiple sets of IPs, here is an example of three nodes.

    1
    2
    3
    4
    5
    6
    7
    8
    
    export oldip1=x.x.x.1 
    export newip1=x.x.10.1 
    
    export oldip2=x.x.x.2
    export newip2=x.x.10.2 
    
    export oldip3=x.x.x.3 
    export newip3=x.x.10.3 
    
    1
    2
    3
    
    sed -i "s/$oldip1/$newip1/" /etc/etcd.env
    sed -i "s/$oldip2/$newip2/" /etc/etcd.env
    sed -i "s/$oldip3/$newip3/" /etc/etcd.env
    

    /etc/hosts also needs to replace the IP, as sometimes the hostname is used in the configuration file.

    1
    2
    3
    
    sed -i "s/$oldip1/$newip1/" /etc/hosts
    sed -i "s/$oldip2/$newip2/" /etc/hosts
    sed -i "s/$oldip3/$newip3/" /etc/hosts
    

    If you have a regular backup task, you will also need to replace the relevant IP.

    1
    2
    3
    
    sed -i "s/$oldip1/$newip1/" /usr/local/bin/kube-scripts/etcd-backup.sh
    sed -i "s/$oldip2/$newip2/" /usr/local/bin/kube-scripts/etcd-backup.sh
    sed -i "s/$oldip3/$newip3/" /usr/local/bin/kube-scripts/etcd-backup.sh
    
  • Each Etcd node restores Etcd data from a backup

    This step can be skipped if Etcd is a single node. The Etcd cluster is no longer operational because the node IPs have changed. Multi-node Etcd needs to use backup data to recover, because Etcd’s node information is stored in the disk data and it is not useful to just modify the configuration file.

    Distribute the Etcd backup file snapshot.db to each Etcd node.

    Execute the following command on each node:

    1
    
    rm -rf /var/lib/etcd
    
    1
    2
    3
    4
    5
    
    etcdctl snapshot restore snapshot.db --name etcd-node1 \
            --initial-cluster "etcd-node1=https://x.x.10.1:2380,etcd-node2=https://x.x.10.2:2380,etcd-node3=https://x.x.10.3:2380" \
            --initial-cluster-token k8s_etcd \
            --initial-advertise-peer-urls https://x.x.10.1:2380 \
            --data-dir=/var/lib/etcd
    

    Note that the etcd-node1 name, -initial-advertise-peer-urls parameter will vary on each node.

  • Restart etcd for each Etcd node

    1
    
    systemctl restart etcd
    
  • View etcd status per Etcd node

    1
    
    systemctl status etcd
    

2. Update K8s certificate

  • Backing up certificates
1
cp -R /etc/kubernetes/ /etc/kubernetes-bak
  • Each Kubernetes node replaces the IP address in the associated file

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    
    # master
    export oldip1=x.x.x.1 
    export newip1=x.x.10.1 
    
    export oldip2=x.x.x.2
    export newip2=x.x.10.2 
    
    export oldip3=x.x.x.3 
    export newip3=x.x.10.3 
    
    # node
    export oldip4=x.x.x.4
    export newip4=x.x.10.4
    
    1
    2
    3
    4
    
    find /etc/kubernetes -type f | xargs sed -i "s/$oldip1/$newip1/"
    find /etc/kubernetes -type f | xargs sed -i "s/$oldip2/$newip2/"
    find /etc/kubernetes -type f | xargs sed -i "s/$oldip3/$newip3/"
    find /etc/kubernetes -type f | xargs sed -i "s/$oldip4/$newip4/"
    
    1
    2
    3
    4
    
    sed -i "s/$oldip1/$newip1/" /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
    sed -i "s/$oldip2/$newip2/" /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
    sed -i "s/$oldip3/$newip3/" /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
    sed -i "s/$oldip4/$newip4/" /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
    
    1
    2
    3
    4
    
    sed -i "s/$oldip1/$newip1/" /etc/kubernetes/kubeadm-config.yaml
    sed -i "s/$oldip2/$newip2/" /etc/kubernetes/kubeadm-config.yaml
    sed -i "s/$oldip3/$newip3/" /etc/kubernetes/kubeadm-config.yaml
    sed -i "s/$oldip4/$newip4/" /etc/kubernetes/kubeadm-config.yaml
    
    1
    2
    3
    4
    
    sed -i "s/$oldip1/$newip1/" /etc/hosts
    sed -i "s/$oldip2/$newip2/" /etc/hosts
    sed -i "s/$oldip3/$newip3/" /etc/hosts
    sed -i "s/$oldip4/$newip4/" /etc/hosts
    
  • Generate a certificate at a master node

    1
    
    rm -f /etc/kubernetes/pki/apiserver*
    
    1
    
    kubeadm init phase certs all --config /etc/kubernetes/kubeadm-config.yaml
    
  • Each Kubernetes node distributes the generated certificates to the nodes

    1
    
    The node node does not need a key, only a crt.
    

3. update the Conf file of the cluster component

  • Generate a new configuration file in a master node

    1
    2
    
    cd /etc/kubernetes
    rm -f admin.conf kubelet.conf controller-manager.conf scheduler.conf
    
    1
    
    kubeadm init phase kubeconfig all --config /etc/kubernetes/kubeadm-config.yaml
    
  • Each Kubernetes node distributes the new configuration file to each node

    Each node needs /etc/kubernetes/kubelet.conf and each master node needs /etc/kubernetes/controller-manager.conf and /etc/kubernetes/scheduler.conf.

  • Configure user access credentials on the node that needs to use kubectl

    1
    
    cp /etc/kubernetes/admin.conf $HOME/.kube/config
    
  • Restart the kubelet for each Kubernetes node

    1
    2
    
    systemctl daemon-reload
    systemctl restart kubelet
    
  • View kubelet status per Kubernetes node

    1
    
    systemctl status kubelet
    

4. Fix ConfigMap

  • Replace IP

    1
    
    kubectl -n kube-system edit cm kube-proxy
    

    kube-proxy affects node communication. If you are using an LB or a domain as an Apiserver entry, you can also leave it out. As for kubeadm-config, it is automatically replaced in the above steps, so no additional processing is needed.

5. Summary

It is strongly recommended that you do not change the IP address of the cluster hosts. If the change is an expected host IP change, you can rebuild the cluster by backup-restore.

If it is an unintended host IP change, it is recommended to fix it in the above order:

  1. Etcd
  2. K8s certificate
  3. K8s Master Node, Node Node core components
  4. cluster ConfigMap configuration

The above mentioned content has documented the repair process. However, the site is very complicated when repairing multiple master nodes. The container was restarting continuously, during which it kept reporting port conflicts, for which I also restarted the machine once. The repair process may be imperfectly documented, but just follow the sequence, one component at a time, should not be a big problem.