Resolving CoreDNS custom domain name failure issues

A few days ago we used NodeLocal DNSCache to solve the 5 second timeout problem with CoreDNS, and the cluster DNS resolution performance was significantly improved. However, today we encountered a major pitfall, when we were doing DevOps experiments, the tools were using custom domains, so we needed to add custom domain name resolution to access each other, which we could solve by adding hostAlias to Pods, but when using Jenkins’ Kubernetes plugin, this parameter was not supported. This parameter is not supported when using Jenkins’ Kubernetes plugin and needs to be defined using YAML, which is a bit of a pain, so we thought we’d add an A record via CoreDNS to solve this problem.

Normally we just need to add the hosts plugin to the ConfigMap of CoreDNS and it will work.

hosts {
  10.151.30.11 git.k8s.local
  fallthrough
}

However, after the configuration is complete, the custom domain name never resolves.

$ kubectl run -it --image busybox:1.28.4 test --restart=Never --rm /bin/sh
If you don't see a command prompt, try pressing enter.
/ # nslookup git.k8s.local
Server:    169.254.20.10
Address 1: 169.254.20.10

nslookup: can't resolve 'git.k8s.local'

This is a bit strange, doesn’t the hosts plugin work this way? After some checking, I was convinced that this was the right way to configure it. Then I turned on CoreDNS logging to filter the resolution logs for the above domain name.

We can see that we walked through the search field, but did not get the correct parsing result, which is a bit puzzling. After tossing around a bit, it occurred to me that we have NodeLocal DNSCache enabled in the cluster, could this be the cause of the problem? Isn’t this the component that forwards queries to CoreDNS when the resolution doesn’t hit?

To verify this, let’s test the resolution directly using the CoreDNS address: NodeLocal DNSCache.

/ # nslookup git.k8s.local 10.96.0.10
Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name:      git.k8s.local
Address 1: 10.151.30.11 git.k8s.local

It was found to be correct, which means that there is nothing wrong with the CoreDNS configuration, and the problem must be caused by the NodeLocal DNSCache, which was found to be a direct failure using the LocalDNS address (169.254.20.10).

/ # nslookup git.k8s.local 169.254.20.10
Server:    169.254.20.10
Address 1: 169.254.20.10

nslookup: can't resolve 'git.k8s.local'

At this point it’s time to look at the LocalDNS Pod logs:

$ kubectl logs -f node-local-dns-bb84m -n kube-system
......
2020/05/14 05:30:21 [INFO] Updated Corefile with 0 custom stubdomains and upstream servers /etc/resolv.conf
2020/05/14 05:30:21 [INFO] Using config file:
cluster.local:53 {
    errors
    cache {
            success 9984 30
            denial 9984 5
    }
    reload
    loop
    bind 169.254.20.10 10.96.0.10
    forward . 10.96.207.156 {
            force_tcp
    }
    prometheus :9253
    health 169.254.20.10:8080
    }
in-addr.arpa:53 {
    errors
    cache 30
    reload
    loop
    bind 169.254.20.10 10.96.0.10
    forward . 10.96.207.156 {
            force_tcp
    }
    prometheus :9253
    }
ip6.arpa:53 {
    errors
    cache 30
    reload
    loop
    bind 169.254.20.10 10.96.0.10
    forward . 10.96.207.156 {
            force_tcp
    }
    prometheus :9253
    }
.:53 {
    errors
    cache 30
    reload
    loop
    bind 169.254.20.10 10.96.0.10
    forward . /etc/resolv.conf {
            force_tcp
    }
    prometheus :9253
    }
......
[INFO] plugin/reload: Running configuration MD5 = 3e3833f9361872f1d34bc97155f952ca
CoreDNS-1.6.7
linux/amd64, go1.11.13,

Analyzing the LocalDNS configuration information above, 10.96.0.10 is the Service ClusterIP of CoreDNS, 169.254.20.10 is the IP address of LocalDNS, and 10.96.207.156 is a new Service ClusterIP created by LocalDNS This Service is associated with the same list of CoreDNS Endpoints as CoreDNS.

A closer look reveals that cluster.local, in-addr.arpa and ip6.arpa are forwarded to 10.96.207.156 via forward, i.e. to CoreDNS for resolution, while the others are forward . /etc/resolv.conf through the resolv.conf file, which reads as follows.

1
2
3

nameserver 169.254.20.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

So when we resolve the domain git.k8s.local we need to go through the search domain, while the domain cluster.local is directly forwarded to CoreDNS for resolution, CoreDNS naturally does not resolve these days records. So isn’t it natural to think that we can just configure the hosts plugin on the LocalDNS side? This should be exactly the right idea:

$ kubectl edit cm node-local-dns -n kube-system
......
.:53 {
    errors
    hosts {  # 添加 A 记录
      10.151.30.11 git.k8s.local
      fallthrough
    }
    cache 30
    reload
    loop
    bind 169.254.20.10 10.96.0.10
    forward . __PILLAR__UPSTREAM__SERVERS__ {
            force_tcp
    }
    prometheus :9253
}
......

After the update is complete, we can manually rebuild the NodeLocalDNS Pod and find that the NodeLocalDNS Pod fails to start, with the following error message.

`1`	`no action found for directive 'hosts' with server type 'dns'`

It turns out that the hosts plugin is not supported at all. Then we have to go to CoreDNS to resolve it, so this time we need to change forward . /etc/resolv.conf to forward . 10.96.207.156, which will go to CoreDNS, and make the following changes in the ConfigMap of NodeLocalDNS.

$ kubectl edit cm node-local-dns -n kube-system
......
.:53 {
    errors
    cache 30
    reload
    loop
    bind 169.254.20.10 10.96.0.10
    forward . __PILLAR__CLUSTER__DNS__ {
            force_tcp
    }
    prometheus :9253
}
......

Once the same changes are made, the NodeLocalDNS pod will need to be rebuilt for the changes to take effect.

The __PILLAR__CLUSTER__DNS__ and __PILLAR__UPSTREAM__SERVERS__ parameters are automatically configured in mirror 1.15.6 and above, and the corresponding values are derived from kube-dns ConfigMap and the custom Upstream Server address.

Now let’s go back and test that the custom domain name resolves properly.

/ # nslookup git.k8s.local
Server:    169.254.20.10
Address 1: 169.254.20.10

Name:      git.k8s.local
Address 1: 10.151.30.11 git.k8s.local

For those using NodeLocalDNS be aware of this issue, if the hosts or rewrite plugins are not working, this is basically the cause of the problem. The best way to troubleshoot problems is always to analyze them through logs.