Skip to content

Misconfigured User-data would prevent Bottlerocket to join cluster. #4779

@guessi

Description

@guessi

Image I'm using:

  • bottlerocket-aws-k8s-1.35-x86_64-v1.55.0-d93bb1b1

What I expected to happen:

Misconfigured User-data would prevent Bottlerocket to join cluster.

What actually happened:

  1. Launch Bottlerocket with the following User-data
settings.kubernetes.cluster-name = 'eks-cluster-debug'
settings.kubernetes.api-server = 'https://EXAMPLE.us-east-1.eks.amazonaws.com'
settings.kubernetes.cluster-certificate = 'EXAMPLE'
settings.kubernetes.cluster-dns-ip = 'EXAMPLE'
settings.kubernetes.max-pods = 110
settings.kubernetes.node-labels.'eks.amazonaws.com/nodegroup' = 'bottlerocket-1'
# ... (OMITTED)

settings.aws.region = "us-east-2" # <------------ Key to the issue.
  1. A misconfigured settings.aws.region in User-data (Not aligned with the actual region where the cluster located).

  2. Bottlerocket node would not join cluster and not able to SSH/SSM into the node for troubleshooting.

  3. After taking volume snapshot and mount to another node, check the journal log. It's blocking by pluto.service

[root@ip-192-168-70-100 data-backup]# journalctl --file ./var/log/journal/ec278fedcd262b537e7eee9392040982/system.journal | grep pluto
Mar 05 05:03:58 [localhost](http://localhost/) pluto[1534]: Timed out retrieving private DNS name from EC2: deadline has elapsed
Mar 05 05:03:58 [localhost](http://localhost/) systemd[1]: pluto.service: Main process exited, code=exited, status=1/FAILURE
Mar 05 05:03:58 [localhost](http://localhost/) systemd[1]: pluto.service: Failed with result 'exit-code'.

How to reproduce the problem:

See above message.

--= Edit =--

After deep dive, I can see pluto plays a key role in generating Kubernetes corresponding parameter configurations during Bottlerocket startup. It involves interactions such as IMDS, AWS EC2 API and Kubernetes API as describe at GitHub page:

https://github.com/bottlerocket-os/bottlerocket-core-kit/tree/develop/sources/api/pluto

In the process, pluto will confirm the current EC2 Instance ID through IMDS when generate_node_name() is call:

https://github.com/bottlerocket-os/bottlerocket-core-kit/blob/develop/sources/api/pluto/src/main.rs#L464-L470

Next, send the EC2 Instance ID to get_private_dn_name(region, &instance_id,...) to confirm the hostname of the node:

https://github.com/bottlerocket-os/bottlerocket-core-kit/blob/develop/sources/api/pluto/src/main.rs#L473-L483

Among them, get_private_dn_name() will try repeatedly until the default timeout of 5 minutes, which may cause the error to appear as you have observed:

https://github.com/bottlerocket-os/bottlerocket-core-kit/blob/develop/sources/api/pluto/src/ec2.rs#L43-L87

Here you can see that the key factors affecting the success or failure of get_private_dn_name() are the following parameters:

region, instance_id, http_proxy, no_proxy

If this step fails, the node may not be able to join the cluster.

Metadata

Metadata

Assignees

No one assigned

    Labels

    status/needs-triagePending triage or re-evaluationtype/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions