Adding New Nodes to your K3s cluster¶

This guide outlines the process of adding new nodes to your existing K3s cluster using the AWS CloudShell interface. We'll cover the steps necessary to create and provision an AWS EC2 instance and then connect it to your K3s cluster.

Prerequisites¶

An existing and functional K3s cluster.
AWS account with appropriate permissions.
AWS CloudShell access configured with your AWS credentials.
Familiarity with basic Linux commands and AWS concepts.

Step 1: Set Up Environment Variables in AWS CloudShell¶

To simplify the AWS CLI commands, we'll define several environment variables within your CloudShell session. Copy and paste the following commands into your CloudShell terminal to set these variables.

You can create a file named EnvVars with the variable assignments and then source the file. Make sure to replace the example values with your actual configuration.. For Example:

# AWS Information
export REGION="us-east-2"                   # AWS Default Region
export CIDR_VPC="10.0.0.0/24"               # VPC CIDR
export CIDR_SUBNET="10.0.0.0/24"            # VPC Subnet
export SSH_KEY_NAME="MVAI-SSH-Key"          # SSH Key Pair Name
export SG_NAME="MVAIsg"                     # Security Group Name
export VPC_NAME="MemVergeAI-VPC"            # VPN Name
export SUBNET_NAME="MemVergeAI-Subnet"      # Subnet Name
export RT_NAME="MemVergeAI-RouteTable"      # Routing Table Name
export IG_NAME="MemVergeAI-IGW"             # Ingress Name
export FILE_SYSTEM_NAME="MemVergeAI-EFS"    # EFS File System Name
export VPC_ID="vpc-01bdeafcc0ce883e5"       # VPC ID
export SUBNET_ID="subnet-01f24fb72235228ed" # Subnet ID
export IGW_ID="igw-06ffc82ccff0bf75f"       # Ingress ID
export RT_ID="rtb-05e6d9dcaf649f7ff"        # Routing Table ID
export SG_ID="sg-00dbfae93e065b028"         # Security Group ID
export AMI_ID="ami-0c3b809fcf2445b6a"       # AMI Image ID for Ubuntu 22.04
export FILE_SYSTEM_ID="fs-06089fdf3a7751a5f" # EFS File System ID
export EFS_DNSNAME="fs-06089fdf3a7751a5f.efs.us-east-2.amazonaws.com" # EFS File System DNS Fully Qualified Name

# GPU Worker Node Info
export INSTANCE_TYPE="g5.2xlarge"
export INSTANCE_NAME="MemVergeAI-GPU-Worker02"

Source the environment variables:

source EnvVars

Verify that the variables are set correctly:

echo $REGION
echo $VPC_ID
echo $SG_ID

Step 2: Create and Configure the EC2 Instance¶

Use the AWS CLI to launch a new EC2 instance with the defined parameters. This command creates a g5.2xlarge instance in your VPC, assigns it to your security group, associates it with your SSH key, and tags it with a name.

WORKER_INSTANCE_ID=$(aws ec2 run-instances \
   --image-id $AMI_ID \
   --count 1 \
   --region $REGION \
   --instance-type g5.2xlarge \
   --key-name $SSH_KEY_NAME \
   --security-group-ids $SG_ID \
   --subnet-id $SUBNET_ID \
   --associate-public-ip-address \
   --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=MemVergeAI-GPU-Worker02}]' \
   --block-device-mappings "[
       {
          \"DeviceName\": \"/dev/sda1\",
          \"Ebs\": {
             \"VolumeSize\": 60,
             \"VolumeType\": \"gp3\"
          }
       }
    ]" \
   --query 'Instances[0].InstanceId' \
   --output text)

echo "New GPU worker instance ID: $WORKER_INSTANCE_ID"

Explanation:
--image-id: Specifies the AMI to use (Ubuntu 22.04 in this example).
--count 1: Launches one instance.
--instance-type: Sets the instance type (defaults to g5.2xlarge).
--key-name: Associates the instance with your SSH key for secure access.
--security-group-ids: Assigns the instance to your existing security group.
--subnet-id: Launches the instance in your specified subnet.
--associate-public-ip-address: Requests a public IP address for the instance.
--tag-specifications: Adds a tag to the instance for identification.
--region: Specifies the AWS region.
--block-device-mappings: Sets the OS boot drive size to 60GB and type gp3.

Get the Instance ID & IP Address¶

After running the run-instances command, note the Instance ID from the output. We'll use this to retrieve the public IP address. Alternatively, go to the AWS EC2 console and find your newly created instance.

Example:

aws ec2 describe-instances \
    --filters "Name=vpc-id,Values=$VPC_ID" \
    --query "Reservations[].Instances[].{ID:InstanceId,Name:Tags[?Key=='Name']|[0].Value,State:State.Name,PublicIP:PublicIpAddress,PrivateIP:PrivateIpAddress}" \
    --output table

Example:

----------------------------------------------------------------------------------------------------
|                                         DescribeInstances                                        |
+---------------------+---------------------------+-------------+----------------+-----------------+
|         ID          |           Name            |  PrivateIP  |   PublicIP     |      State      |
+---------------------+---------------------------+-------------+----------------+-----------------+
|  i-0770b293b7b6383e0|  MemVergeAI-Management01  |  10.0.0.156 |  3.20.192.186  |  running        |
|  i-02a83e5064fccd806|  MemVergeAI-GPU-Worker01  |  10.0.0.9   |  3.128.242.144 |  running        |
|  i-04d3204eab2b5eb6c|  MemVergeAI-GPU-Worker02  |  10.0.0.51  |  18.117.72.15  |  running        |
+---------------------+---------------------------+-------------+----------------+-----------------+

Continue once all instances are in the running state.

Save the Public IP address to the NEW_NODE_IP variable:

export NEW_NODE_IP="10.0.0.51"

SSH to the new Instance¶

SSH to the new host.

Example:

ssh -i "MVAI-SSH-Key.pem" ubuntu@ec2-18-117-72-15.us-east-2.compute.amazonaws.com

Renaming AWS EC2 Hostnames (Optional)¶

The default hostnames created by AWS are not intuitive for the MemVerge.ai cluster. You can rename your AWS EC2 instances to more intuitive hostnames like mvai-nvgpu02. This will make your cluster management more manageable.

Update the hostname on each instance

SSH into each EC2 instance and run the following commands:
```
sudo hostnamectl set-hostname new-hostname 
```
Replace "new-hostname" with your desired hostname (e.g., MemVerge.ai-mgmt, MemVerge.ai-node001).
Update /etc/hosts file

Edit the /etc/hosts file and add a line with the new hostname below the default 127.0.0.1 localhost line:
```
127.0.0.1 new-hostname
```
Update DNS settings (Optional)

If you're using Amazon Route 53 or another DNS service, update the DNS records to reflect the new hostnames.
Reboot the host:
```
sudo systemctl reboot
```
When the system boots, verify the new hostname is correct:
```
hostnamectl
```

Updating /etc/hosts on All Nodes¶

To ensure proper communication between nodes in your cluster, you must add the hostnames and IP addresses of all nodes to the /etc/hosts file on each system. This step is crucial when not using DNS for hostname resolution. If you use DNS, this step is not required. Ensure your DNS entries are correct.

Gather the private IP addresses and hostnames of all nodes in your cluster using ip a.
SSH into each node (management and worker nodes). The default user for Ubuntu Linux is ubuntu:
```
ssh ubuntu@<node-ip>
```
On each node, edit the /etc/hosts file:
```
sudo vim /etc/hosts
```
Add entries for all nodes in your cluster. The format is:
```
<private-ip> <hostname>
```
For example, add these lines:
```
# MemVerge.ai Cluster IP Addresses and Hostnames
10.0.0.156 mvai-mgmt
10.0.0.9   mvai-nvgpu01
10.0.0.51  mvai-nvgpu02
```
Add an entry for each node in your cluster, including the node you're currently editing.
Save the file and exit the editor.
Repeat steps 2-5 for each node in your cluster.
Verify the changes by pinging other nodes using their hostnames:
```
ping mvai-mgmt
ping mvai-nvgpu01
ping mvai-nvgpu02
```
Ensure that each node can ping all other nodes using their hostnames.

By adding these entries to /etc/hosts on all systems, you ensure that each node can resolve the hostnames of other nodes in the cluster. This is crucial for Kubernetes and other cluster components to communicate properly.

Remember to update the /etc/hosts file on all nodes whenever you add or remove nodes from your cluster. While this manual process works well for smaller, static clusters, using DNS is generally preferred for larger or more dynamic environments.

Step 3: Join the New Node to the K3s Cluster¶

Get K3s Server Token and Address. On your K3s management server node, retrieve the K3s server token and server address:

sudo cat /var/lib/rancher/k3s/server/node-token
kubectl get nodes

On the new node instance, run the following command, replacing <K3S_URL> with the K3s server address (e.g., https://mvai-mgmt:6443) and <K3S_TOKEN> with the node token:

curl -sfL https://get.k3s.io | K3S_URL="https://mvai-mgmt:6443" K3S_TOKEN="<K3S_TOKEN>" sh -

Check Node Status. On your K3s management server, verify that the new node has joined the cluster and is in Ready state:

kubectl get nodes

Example:

$ kubectl get nodes
NAME           STATUS   ROLES                  AGE   VERSION
mvai-mgmt      Ready    control-plane,master   23h   v1.31.6+k3s1
mvai-nvgpu01   Ready    <none>                 23h   v1.31.6+k3s1
mvai-nvgpu02   Ready    <none>                 22s   v1.31.6+k3s1

Mounting the EFS Volume on Management and GPU Worker Nodes¶

Install NFS Utilities

On Ubuntu 22.04, the EFS mount requires nfs-common:
```
sudo apt update && sudo apt install -y nfs-common
```
Create a Mount Directory

Create a local mount point (e.g., /mnt/efs) on each node:
```
sudo mkdir -p /mnt/efs
```
Determine the EFS Mount Endpoint

3.1. Using EFS DNS Name
By default, Amazon EFS provides a DNS name in the format:
```
<filesystem-id>.efs.<region>.amazonaws.com
```
For instance, if your $FILE_SYSTEM_ID is fs-06089fdf3a7751a5f and your $REGION is us-east-2, the EFS endpoint would be:
```
fs-06089fdf3a7751a5f.efs.us-east-2.amazonaws.com
```
3.2. Optional: Using the Mount Target IP
As shown in your creation output, the IpAddress might be 10.0.0.190. You can mount using that IP directly, but it’s generally better to rely on the DNS name for high availability and automatic failover between Availability Zones.
Mount the EFS File System

Use the following command example to mount EFS on each node. Replace the DNS with the one displayed in the previous step:
```
sudo mount -t nfs4 -o nfsvers=4.1 \
   fs-06089fdf3a7751a5f.efs.us-east-2.amazonaws.com:/ \
   /mnt/efs
```
Replace:
- fs-06089fdf3a7751a5f.efs.us-east-2.amazonaws.com with your actual EFS DNS endpoint.
- /mnt/efs with the directory you wish to mount on, if different.
Tip: Confirm the mount is successful:
```
df -hT | grep efs
```
You should see an entry similar to:
fs-06089fdf3a7751a5f.efs.us-east-2.amazonaws.com:/ nfs4 … /mnt/efs
Persist the Mount in /etc/fstab

To ensure the EFS file system automatically remounts after reboot or instance stop/start, add an /etc/fstab entry on each node:
```
echo "fs-06089fdf3a7751a5f.efs.us-east-2.amazonaws.com:/ /mnt/efs nfs4 defaults,_netdev 0 0" | sudo tee -a /etc/fstab
```
- _netdev ensures the system knows this mount requires a network connection before mounting.
- You can add additional options (e.g., rsize=1048576, wsize=1048576) if needed, but the above defaults typically suffice.
Once added, test the fstab entry by unmounting and remounting:
```
sudo umount /mnt/efs
sudo mount -a
```

If successful, EFS should remount without errors. Use df -hT | grep efs to confirm the file system is mounted.

Verify the Node is Available in MemVerge.ai¶

Login to the MemVerge.ai Management UI Console and verify you can see the new node in the nodes list https://mvai-mgmt/dashboard/nodes

Node List

Summary¶

Congratulations! You have successfully added a new node to the Kubernetes cluster and you should now see the node in MemVerge.ai's UI.