Image I'm using:
bottlerocket-aws-k8s-1.35-x86_64-v1.55.0-d93bb1b1
What I expected to happen:
Misconfigured User-data would prevent Bottlerocket to join cluster.
What actually happened:
- Launch Bottlerocket with the following User-data
settings.kubernetes.cluster-name = 'eks-cluster-debug'
settings.kubernetes.api-server = 'https://EXAMPLE.us-east-1.eks.amazonaws.com'
settings.kubernetes.cluster-certificate = 'EXAMPLE'
settings.kubernetes.cluster-dns-ip = 'EXAMPLE'
settings.kubernetes.max-pods = 110
settings.kubernetes.node-labels.'eks.amazonaws.com/nodegroup' = 'bottlerocket-1'
# ... (OMITTED)
settings.aws.region = "us-east-2" # <------------ Key to the issue.
-
A misconfigured settings.aws.region in User-data (Not aligned with the actual region where the cluster located).
-
Bottlerocket node would not join cluster and not able to SSH/SSM into the node for troubleshooting.
-
After taking volume snapshot and mount to another node, check the journal log. It's blocking by pluto.service
[root@ip-192-168-70-100 data-backup]# journalctl --file ./var/log/journal/ec278fedcd262b537e7eee9392040982/system.journal | grep pluto
Mar 05 05:03:58 [localhost](http://localhost/) pluto[1534]: Timed out retrieving private DNS name from EC2: deadline has elapsed
Mar 05 05:03:58 [localhost](http://localhost/) systemd[1]: pluto.service: Main process exited, code=exited, status=1/FAILURE
Mar 05 05:03:58 [localhost](http://localhost/) systemd[1]: pluto.service: Failed with result 'exit-code'.
How to reproduce the problem:
See above message.
--= Edit =--
After deep dive, I can see pluto plays a key role in generating Kubernetes corresponding parameter configurations during Bottlerocket startup. It involves interactions such as IMDS, AWS EC2 API and Kubernetes API as describe at GitHub page:
https://github.com/bottlerocket-os/bottlerocket-core-kit/tree/develop/sources/api/pluto
In the process, pluto will confirm the current EC2 Instance ID through IMDS when generate_node_name() is call:
https://github.com/bottlerocket-os/bottlerocket-core-kit/blob/develop/sources/api/pluto/src/main.rs#L464-L470
Next, send the EC2 Instance ID to get_private_dn_name(region, &instance_id,...) to confirm the hostname of the node:
https://github.com/bottlerocket-os/bottlerocket-core-kit/blob/develop/sources/api/pluto/src/main.rs#L473-L483
Among them, get_private_dn_name() will try repeatedly until the default timeout of 5 minutes, which may cause the error to appear as you have observed:
https://github.com/bottlerocket-os/bottlerocket-core-kit/blob/develop/sources/api/pluto/src/ec2.rs#L43-L87
Here you can see that the key factors affecting the success or failure of get_private_dn_name() are the following parameters:
region, instance_id, http_proxy, no_proxy
If this step fails, the node may not be able to join the cluster.
Image I'm using:
bottlerocket-aws-k8s-1.35-x86_64-v1.55.0-d93bb1b1What I expected to happen:
Misconfigured User-data would prevent Bottlerocket to join cluster.
What actually happened:
A misconfigured settings.aws.region in User-data (Not aligned with the actual region where the cluster located).
Bottlerocket node would not join cluster and not able to SSH/SSM into the node for troubleshooting.
After taking volume snapshot and mount to another node, check the journal log. It's blocking by
pluto.serviceHow to reproduce the problem:
See above message.
--= Edit =--
After deep dive, I can see
plutoplays a key role in generating Kubernetes corresponding parameter configurations during Bottlerocket startup. It involves interactions such as IMDS, AWS EC2 API and Kubernetes API as describe at GitHub page:In the process,
plutowill confirm the current EC2 Instance ID through IMDS whengenerate_node_name()is call:Next, send the EC2 Instance ID to
get_private_dn_name(region, &instance_id,...)to confirm thehostnameof the node:Among them,
get_private_dn_name()will try repeatedly until the default timeout of5 minutes, which may cause the error to appear as you have observed:Here you can see that the key factors affecting the success or failure of
get_private_dn_name()are the following parameters:If this step fails, the node may not be able to join the cluster.