Misconfigured User-data would prevent Bottlerocket to join cluster.

**Image I'm using:**

* `bottlerocket-aws-k8s-1.35-x86_64-v1.55.0-d93bb1b1`

**What I expected to happen:**

Misconfigured User-data would prevent Bottlerocket to join cluster.

**What actually happened:**

1. Launch Bottlerocket with the following User-data

```toml
settings.kubernetes.cluster-name = 'eks-cluster-debug'
settings.kubernetes.api-server = 'https://EXAMPLE.us-east-1.eks.amazonaws.com'
settings.kubernetes.cluster-certificate = 'EXAMPLE'
settings.kubernetes.cluster-dns-ip = 'EXAMPLE'
settings.kubernetes.max-pods = 110
settings.kubernetes.node-labels.'eks.amazonaws.com/nodegroup' = 'bottlerocket-1'
# ... (OMITTED)

settings.aws.region = "us-east-2" # <------------ Key to the issue.
```

2. A misconfigured [settings.aws.region](https://bottlerocket.dev/en/os/1.54.x/api/settings/aws/#region) in User-data (Not aligned with the actual region where the cluster located).

3. Bottlerocket node would not join cluster and not able to SSH/SSM into the node for troubleshooting.

4. After taking volume snapshot and mount to another node, check the journal log. It's blocking by `pluto.service`

```bash
[root@ip-192-168-70-100 data-backup]# journalctl --file ./var/log/journal/ec278fedcd262b537e7eee9392040982/system.journal | grep pluto
Mar 05 05:03:58 [localhost](http://localhost/) pluto[1534]: Timed out retrieving private DNS name from EC2: deadline has elapsed
Mar 05 05:03:58 [localhost](http://localhost/) systemd[1]: pluto.service: Main process exited, code=exited, status=1/FAILURE
Mar 05 05:03:58 [localhost](http://localhost/) systemd[1]: pluto.service: Failed with result 'exit-code'.
```

**How to reproduce the problem:**

See above message.


***--= Edit =--***

After deep dive, I can see `pluto` plays a key role in generating Kubernetes corresponding parameter configurations during Bottlerocket startup. It involves interactions such as IMDS, AWS EC2 API and Kubernetes API as describe at GitHub page:

    https://github.com/bottlerocket-os/bottlerocket-core-kit/tree/develop/sources/api/pluto

In the process, `pluto` will confirm the current EC2 Instance ID through IMDS when `generate_node_name()` is call:

    https://github.com/bottlerocket-os/bottlerocket-core-kit/blob/develop/sources/api/pluto/src/main.rs#L464-L470

Next, send the EC2 Instance ID to `get_private_dn_name(region, &instance_id,...)` to confirm the `hostname` of the node:

    https://github.com/bottlerocket-os/bottlerocket-core-kit/blob/develop/sources/api/pluto/src/main.rs#L473-L483

Among them, `get_private_dn_name()` will try repeatedly until the default timeout of `5 minutes`, which may cause the error to appear as you have observed:

    https://github.com/bottlerocket-os/bottlerocket-core-kit/blob/develop/sources/api/pluto/src/ec2.rs#L43-L87

Here you can see that the key factors affecting the success or failure of `get_private_dn_name()` are the following parameters:

    region, instance_id, http_proxy, no_proxy

If this step fails, the node may not be able to join the cluster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misconfigured User-data would prevent Bottlerocket to join cluster. #4779

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Misconfigured User-data would prevent Bottlerocket to join cluster. #4779

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions