Skip to content

Creating the AWS EC2 Environment - A Step-by-Step Guide

Below is a step-by-step guide for creating an AWS environment suitable for hosting a K3s cluster (or other applications). These steps focus on AWS infrastructure only: VPC, subnet, security group, SSH key pair, EC2 instances (with Elastic IP addresses), and an EFS file system.

Note

  1. All commands below assume you are running in the AWS Cloud Shell with us-east-2 as the desired region. You can easily change the default region.
  2. For simplicity, we’ll create only one public subnet. You can expand to more subnets (private, HA, etc.) if needed.
  3. We’ll assign Elastic IP (EIP) addresses to each EC2 instance so their public IPs persist across stops/restarts.
  4. If you wish to add multiple control-plane/management nodes in the future for High Availability, you may need to incorporate a load balancer and point your domain’s DNS A record to that load balancer address rather than a single node’s IP.

1. Set Environment Variables

Open AWS Cloud Shell and set a few variables to make commands easier to run. Adjust the region, CIDR blocks, and other information if desired.

export REGION="us-east-2"
export CIDR_VPC="10.0.0.0/24"
export CIDR_SUBNET="10.0.0.0/24"
export SSH_KEY_NAME="MVAI-SSH-Key"
export SG_NAME="MVAIsg"
export VPC_NAME="MemVergeAI-VPC"
export SUBNET_NAME="MemVergeAI-Subnet"
export RT_NAME="MemVergeAI-RouteTable"
export IG_NAME="MemVergeAI-IGW"
export FILE_SYSTEM_NAME="MemVergeAI-EFS"

Explanation of Variables:

  • REGION="us-east-2"
  • The AWS region where all resources (VPC, EC2 instances, EFS, etc.) will be created.
  • This guide assumes Ohio (us-east-2).
  • CIDR_VPC="10.0.0.0/24"
  • The IP address range for your new VPC.
  • A /16 block supports up to 65,536 IP addresses (enough for production use cases).
  • A /24 block supports up to 256 IP addresses, typically allows for ~251 usable IP addresses, which is enough for most small PoC use cases.
  • Adjust this if you need a larger or smaller network.
  • In most cases, there is no extra cost for having a larger VPC CIDR block than you actually use.
  • CIDR_SUBNET="10.0.0.0/24"
  • The IP address range for a subnet within your VPC.
  • A /24 block supports up to 256 IP addresses (minus AWS overhead), allows for ~251 usable IP addresses.
  • This subnet will be configured as a public subnet.
  • SSH_KEY_NAME="MVAI-SSH-Key"
  • The name of the SSH key pair you will create and use to securely connect to your EC2 instances.
  • SG_NAME="MVAIsg"
  • The name of the AWS Security Group you will create to manage inbound/outbound traffic rules.
  • This guide opens ports for SSH, HTTP, HTTPS, and any additional cluster ports.
  • VPC_NAME="MemVergeAI-VPC"
  • A human-readable name for the VPC resource.
  • Helps you identify this particular VPC in the AWS Console.
  • SUBNET_NAME="MemVergeAI-Subnet"
  • A name tag for your subnet resource.
  • Useful for distinguishing it from other subnets in the same region.
  • RT_NAME="MemVergeAI-RouteTable"
  • The name for the Route Table associated with the above subnet.
  • This table will define routes (e.g., a default route to an Internet Gateway).
  • IG_NAME="MemVergeAI-IGW"
  • A name tag for the Internet Gateway resource.
  • Internet Gateways provide outbound public Internet connectivity for public subnets in your VPC.
  • FILE_SYSTEM_NAME="MemVergeAI-EFS"
  • A name tag for your Amazon EFS file system.
  • By default, EFS scales automatically but can be labeled with a custom name for ease of identification.

You can confirm your default region with:

aws configure get region

If this command returns no output, it's likely your region is defined by $AWS_REGION and $AWS_DEFAULT_REGION environment variables. To verify your region setting, you can use the following command:

echo $AWS_REGION
echo $AWS_DEFAULT_REGION

To find out more information, use aws configure list.

To explicitly set a region, use aws configure set region <your_region>


2. Create a New VPC

A Virtual Private Cloud (VPC) provides network isolation for your cluster. Creating a dedicated VPC helps avoid conflicts with existing infrastructure and gives you full control over subnets, routing, and security.

  1. Create the VPC:

    VPC_ID=$(aws ec2 create-vpc \
       --cidr-block $CIDR_VPC \
       --tag-specifications "ResourceType=vpc,Tags=[{Key=Name,Value=$VPC_NAME}]" \
       --query 'Vpc.VpcId' \
       --output text)
    
    echo "Created VPC with ID: $VPC_ID"
    

    Example:

    echo "Created VPC with ID: $VPC_ID"
    Created VPC with ID: vpc-01bdeafcc0ce883e5
    
  2. (Optional) Enable DNS support in the VPC if not already by default:

    To check whether DNS support and DNS hostnames are already enabled on your VPC, you can use the following AWS CLI commands in Cloud Shell. These will display each attribute for the VPC in question:

    # Replace $VPC_ID with your actual VPC ID if it’s not stored in a variable
    
    # 1. Check DNS support
    aws ec2 describe-vpc-attribute \
    --vpc-id $VPC_ID \
    --attribute enableDnsSupport
    
    # 2. Check DNS hostnames
    aws ec2 describe-vpc-attribute \
    --vpc-id $VPC_ID \
    --attribute enableDnsHostnames
    

    Example Output:

    {
      "VpcId": "vpc-123abc",
      "EnableDnsSupport": {
         "Value": true
      }
    }
    
    {
      "VpcId": "vpc-123abc",
      "EnableDnsHostnames": {
         "Value": false
      }
    }
    
    • If either Value is false, you can enable it by running:
    aws ec2 modify-vpc-attribute --vpc-id $VPC_ID --enable-dns-support "{\"Value\":true}"
    aws ec2 modify-vpc-attribute --vpc-id $VPC_ID --enable-dns-hostnames "{\"Value\":true}"
    

3. Create a Public Subnet

Your subnet designates a range of IP addresses within the VPC. A public subnet has a route to the Internet, required for downloading packages, patches, and for any external traffic.

Create one subnet in the new VPC:

SUBNET_ID=$(aws ec2 create-subnet \
    --vpc-id $VPC_ID \
    --cidr-block $CIDR_SUBNET \
    --availability-zone "${REGION}a" \
    --tag-specifications "ResourceType=subnet,Tags=[{Key=Name,Value=$SUBNET_NAME}]" \
    --query 'Subnet.SubnetId' \
    --output text)

echo "Created Subnet with ID: $SUBNET_ID"

Example output:

Created Subnet with ID: subnet-01f24fb72235228ed

4. Create and Attach an Internet Gateway

An Internet Gateway (IGW) enables your VPC to communicate with the Internet. By attaching an IGW, traffic can flow from your subnet to the Internet (and vice versa), allowing downloads, updates, and inbound connections.

  1. Create the Internet Gateway (IGW):

    IGW_ID=$(aws ec2 create-internet-gateway \
       --tag-specifications "ResourceType=internet-gateway,Tags=[{Key=Name,Value=$IG_NAME}]" \
       --query 'InternetGateway.InternetGatewayId' \
       --output text)
    
    echo "Created Internet Gateway with ID: $IGW_ID"
    

    Example output:

    Created Internet Gateway with ID: igw-06ffc82ccff0bf75f
    
  2. Attach the IGW to your VPC:

    aws ec2 attach-internet-gateway \
       --internet-gateway-id $IGW_ID \
       --vpc-id $VPC_ID
    

5. Create and Associate a Routing Table

A routing table contains rules that determine where traffic is directed. By creating a route that sends 0.0.0.0/0 (i.e., Internet-bound traffic) to the IGW, you ensure that resources in your subnet can reach the outside world.

  1. Create the route table:

    RT_ID=$(aws ec2 create-route-table \
       --vpc-id $VPC_ID \
       --tag-specifications "ResourceType=route-table,Tags=[{Key=Name,Value=$RT_NAME}]" \
       --query 'RouteTable.RouteTableId' \
       --output text)
    
    echo "Created Route Table with ID: $RT_ID"
    

    Example output:

    Created Route Table with ID: rtb-05e6d9dcaf649f7ff
    
  2. Create a default route that sends Internet-bound traffic to the IGW:

    aws ec2 create-route \
       --route-table-id $RT_ID \
       --destination-cidr-block 0.0.0.0/0 \
       --gateway-id $IGW_ID
    

    Expected Output:

    {
      "Return": true
    }
    
  3. Associate the subnet with the route table (making it a public subnet):

    aws ec2 associate-route-table \
       --route-table-id $RT_ID \
       --subnet-id $SUBNET_ID
    

    Example output:

    {
       "AssociationId": "rtbassoc-066be7655ad3c78e4",
       "AssociationState": {
          "State": "associated"
       }
    }
    

6. Create a Security Group

A security group acts as a virtual firewall. By defining inbound and outbound rules, you control access to your instances. This guide opens the ports required for SSH, HTTP, HTTPS, EFS, and the internal ports needed by K3s, Grafana, and Prometheus.

Next, create a security group that allows:

  • SSH (22) from anywhere (you can restrict to specific IPs for better security).
  • HTTP (80) from anywhere.
  • HTTPS (443) from anywhere.
  • NFS (2049) for EFS mounting (from within the same VPC).
  • Internal cluster traffic (e.g., K3s on port 6443, etc.) from within the same security group.

Create the security group

SG_ID=$(aws ec2 create-security-group \
   --group-name $SG_NAME \
   --vpc-id $VPC_ID \
   --query 'GroupId' \
   --output text)

echo "Created Security Group with ID: $SG_ID"

Example output:

Created Security Group with ID: sg-00dbfae93e065b028

Authorize inbound rules

  • SSH (22) from anywhere:
aws ec2 authorize-security-group-ingress \
   --group-id $SG_ID \
   --protocol tcp \
   --port 22 \
   --cidr 0.0.0.0/0
  • HTTP (80) from anywhere:
aws ec2 authorize-security-group-ingress \
   --group-id $SG_ID \
   --protocol tcp \
   --port 80 \
   --cidr 0.0.0.0/0
  • HTTPS (443) from anywhere:
aws ec2 authorize-security-group-ingress \
   --group-id $SG_ID \
   --protocol tcp \
   --port 443 \
   --cidr 0.0.0.0/0
  • Internal communication on ports that K3s/Grafana/Prometheus might need (e.g., 6443, 3000, 9090, 9093):
# NFS/EFS allow from same VPC CIDR or the same security group
aws ec2 authorize-security-group-ingress \
   --group-id $SG_ID \
   --protocol tcp \
   --port 2049 \
   --cidr $CIDR_VPC

# Kubernetes on 6443 (K3s/K8s API):
aws ec2 authorize-security-group-ingress \
   --group-id $SG_ID \
   --protocol tcp \
   --port 6443 \
   --source-group $SG_ID

# Grafana on 3000:
aws ec2 authorize-security-group-ingress \
   --group-id $SG_ID \
   --protocol tcp \
   --port 3000 \
   --source-group $SG_ID

# Prometheus on 9090:
aws ec2 authorize-security-group-ingress \
   --group-id $SG_ID \
   --protocol tcp \
   --port 9090 \
   --source-group $SG_ID

# Kubectl Manager on 10257:
aws ec2 authorize-security-group-ingress \
   --group-id $SG_ID \
   --protocol tcp \
   --port 10257 \
   --source-group $SG_ID

# Kubelet API on 10250:
aws ec2 authorize-security-group-ingress \
   --group-id $SG_ID \
   --protocol tcp \
   --port 10250 \
   --source-group $SG_ID

# Kube Scheduler on 10259:
aws ec2 authorize-security-group-ingress \
   --group-id $SG_ID \
   --protocol tcp \
   --port 10259 \
   --source-group $SG_ID

# Etcd on Port Range 2379-2380
aws ec2 authorize-security-group-ingress \
   --group-id $SG_ID \
   --protocol tcp \
   --port 2379-2380 \
   --source-group $SG_ID

# NodePort on Port Range 30000-32767
aws ec2 authorize-security-group-ingress \
   --group-id $SG_ID \
   --protocol tcp \
   --port 30000-32767 \
   --source-group $SG_ID

# ICMP Echo/Ping
aws ec2 authorize-security-group-ingress \
   --group-id $SG_ID \
   --protocol icmp \
   --port -1 \
   --cidr $CIDR_VPC

# DNS on port 53 (CoreDNS) for TCP and UDP protocols:
aws ec2 authorize-security-group-ingress \
   --group-id $SG_ID \
   --protocol tcp \
   --port 53 \
   --source-group $SG_ID

aws ec2 authorize-security-group-ingress \
   --group-id $SG_ID \
   --protocol udp \
   --port 53 \
   --source-group $SG_ID

7. Create a New SSH Key Pair

You need a secure way to log into the instances. Creating a dedicated SSH key pair ensures unique credentials and minimizes the risk of unauthorized access.

Generate a new key pair to SSH into your instances:

# Create the .pem SSH key pair file
aws ec2 create-key-pair \
   --key-name $SSH_KEY_NAME \
   --query 'KeyMaterial' \
   --output text > ${SSH_KEY_NAME}.pem

# Change permissions of the .pmem file
chmod 400 ${SSH_KEY_NAME}.pem

# Confirm the full .pmem file name
echo ${SSH_KEY_NAME}.pem

# Show the full pato the .pmem file
ls -d $PWD/${SSH_KEY_NAME}.pem
  • This command creates a file named MVAI-SSH-Key.pem in Cloud Shell.
  • IMPORTANT: Keep this .pem file secure. You will use it to SSH into the EC2 instances.

7.1 Download the SSH Key Pair (.pem file)

The SSH Key Pair file (e.g., MVAI-SSH-Key.pem) currently resides on the CloudShell virtual machine. Here’s how you can download that .pem file to your local system:

  1. Open AWS CloudShell in your web browser.
  2. Confirm the key file exists by running:

    ls -l *.pmem
    

    You should see your .pem file in the current directory (e.g., MVAI-SSH-Key.pem).

  3. Use CloudShell’s Download Option:

    • In the CloudShell console, click on the three-dots menu or “Actions” in the top-right corner.
    • Select “Download file” (the exact wording may vary slightly).
    • When prompted, enter the full path to your .pem file. For example, if you see it in your home directory, you might type:
    /home/cloudshell-user/MVAI-SSH-Key.pem
    
    • Choose where to save the file on your local computer.
  4. Set Permissions Locally (Recommended)
    Once the file is on your local laptop:

    • On macOS/Linux:
    chmod 400 /path/to/MVAI-SSH-Key.pem
    
    • On Windows (OpenSSH in PowerShell) you can similarly set file permissions or use Windows file properties to restrict access.
  5. Verify by listing the file or attempting to connect to your EC2 instance:

    ssh -i /path/to/MVAI-SSH-Key.pem ubuntu@<EC2-Public-IP-or-DNS>
    

8. Launch EC2 Instances

8.1 Launch the Management/Control Plane Node

An Amazon Machine Image (AMI) contains the operating system and any pre-configured software. We recommend Ubuntu 22.04 LTS.

  1. Find an Ubuntu 22.04 AMI in us-east-2. You can do:

    AMI_ID=$(aws ec2 describe-images \
      --owners 099720109477 \
      --filters "Name=name,Values=ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*" \
      --query 'sort_by(Images, &CreationDate)[-1].ImageId' \
      --output text)
    
    echo "Using Ubuntu 22.04 AMI: $AMI_ID"
    
  2. Verify the Root Device Name Different AMIs sometimes have different root device names (e.g., /dev/sda1, /dev/xvda), so you may want to confirm the correct root device for your AMI. You can run:

    aws ec2 describe-images \
       --image-ids $AMI_ID \
       --query 'Images[0].RootDeviceName' \
       --output text
    

    Example:

    /dev/sda1
    

    This confirms the Ubuntu 22.04 image uses /dev/sda1 for the OS boot disk name.

  3. Launch the m5.xlarge instance

    The management (control plane) node runs K3s server components and orchestrates the cluster. We recommend using an m5.xlarge, or larger, instance type for balanced CPU and memory resources. This command provisions the host and assigns a 60GiB GP3 OS boot disk:

    MGMT_INSTANCE_ID=$(aws ec2 run-instances \
       --image-id $AMI_ID \
       --instance-type m5.xlarge \
       --key-name $SSH_KEY_NAME \
       --security-group-ids $SG_ID \
       --subnet-id $SUBNET_ID \
       --tag-specifications "ResourceType=instance,Tags=[{Key=Name,Value=MemVergeAI-Management01}]" \
       --block-device-mappings "[
          {
             \"DeviceName\": \"/dev/sda1\",
             \"Ebs\": {
                \"VolumeSize\": 60,
                \"VolumeType\": \"gp3\"
             }
          }
       ]" \
       --query 'Instances[0].InstanceId' \
       --output text)
    
    echo "Management instance ID: $MGMT_INSTANCE_ID"
    

    Example

    Management instance ID: i-0cb85e6f74ee2b763
    

8.2 Launch the GPU Worker Node

Launch the g5.2xlarge instance:

If your applications rely on GPU compute, a g5.2xlarge instance provides NVIDIA A10 GPU resources. It can handle ML workloads and other GPU-accelerated tasks.

WORKER_INSTANCE_ID=$(aws ec2 run-instances \
   --image-id $AMI_ID \
   --instance-type g5.2xlarge \
   --key-name $SSH_KEY_NAME \
   --security-group-ids $SG_ID \
   --subnet-id $SUBNET_ID \
   --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=MemVergeAI-GPU-Worker01}]' \
   --block-device-mappings "[
       {
          \"DeviceName\": \"/dev/sda1\",
          \"Ebs\": {
             \"VolumeSize\": 60,
             \"VolumeType\": \"gp3\"
          }
       }
    ]" \
   --query 'Instances[0].InstanceId' \
   --output text)

echo "GPU worker instance ID: $WORKER_INSTANCE_ID"

Example

GPU worker instance ID: i-073f4d0198334eeac

8.3 Wait for the Instances to Start

Wait a few moments for these instances to transition to a running state. Use the following command to check and monitor their status:

aws ec2 describe-instances \
    --filters "Name=vpc-id,Values=$VPC_ID" \
    --query "Reservations[].Instances[].{ID:InstanceId,Name:Tags[?Key=='Name']|[0].Value,State:State.Name,PublicIP:PublicIpAddress,PrivateIP:PrivateIpAddress}" \
    --output table

Example:

----------------------------------------------------------------------------------------------
|                                      DescribeInstances                                     |
+----------------------+---------------------------+-------------+----------------+----------+
|          ID          |           Name            |  PrivateIP  |   PublicIP     |  State   |
+----------------------+---------------------------+-------------+----------------+----------+
|  i-0770b293b7b6383e0 |  MemVergeAI-Management01  |  10.0.0.156 |  None.         |  running |
|  i-02a83e5064fccd806 |  MemVergeAI-GPU-Worker01  |  10.0.0.9   |  None.         |  running |
+----------------------+---------------------------+-------------+----------------+----------+

Continue once all instances are in the running state.


9. Allocate and Assign Elastic IPs (Optional)

By default, AWS gives instances a dynamic public IP. Elastic IPs (EIPs) remain the same even if you stop and start the instance, ensuring consistent DNS mappings and preventing certificate issues with Let’s Encrypt.

Because we want static public IPs that persist through instance stop/start cycles, allocate an Elastic IP for each instance.

  1. Management Node EIP:

    MGMT_EIP_ALLOC_ID=$(aws ec2 allocate-address --query 'AllocationId' --output text)
    echo "Management EIP Allocation ID: $MGMT_EIP_ALLOC_ID"
    
    aws ec2 associate-address \
      --instance-id $MGMT_INSTANCE_ID \
      --allocation-id $MGMT_EIP_ALLOC_ID
    

    Example Output

    Management EIP Allocation ID: eipalloc-0cc79d97965e5986b
    {
      "AssociationId": "eipassoc-099410eb11afe9ca2"
    }
    
  2. You can retrieve the public IP by:

    MGMT_PUBLIC_IP=$(aws ec2 describe-addresses \
     --allocation-ids $MGMT_EIP_ALLOC_ID \
     --query 'Addresses[0].PublicIp' \
     --output text)
     echo "Management Node Public IP: $MGMT_PUBLIC_IP"
    

    Example:

    Management Node Public IP: 3.128.242.144
    
  3. Worker Node EIP:

    WORKER_EIP_ALLOC_ID=$(aws ec2 allocate-address --query 'AllocationId' --output text)
    echo "Worker EIP Allocation ID: $WORKER_EIP_ALLOC_ID"
    
    aws ec2 associate-address \
      --instance-id $WORKER_INSTANCE_ID \
      --allocation-id $WORKER_EIP_ALLOC_ID
    

    Example output:

    Worker EIP Allocation ID: eipalloc-0ce1a7fdc0990679f
    {
     "AssociationId": "eipassoc-0363ddecb1afe643c"
    }
    
    • Retrieve the public IP:

      WORKER_PUBLIC_IP=$(aws ec2 describe-addresses \
       --allocation-ids $WORKER_EIP_ALLOC_ID \
       --query 'Addresses[0].PublicIp' \
       --output text)
      echo "Worker Node Public IP: $WORKER_PUBLIC_IP"
      

      Example:

      Worker Node Public IP: 3.20.19.118
      

Now both instances have static public IPs that remain consistent across reboots.

Here is another way to confirm both EC2 instances now have a 'PublicIP':

aws ec2 describe-instances \
    --filters "Name=vpc-id,Values=$VPC_ID" \
    --query "Reservations[].Instances[].{ID:InstanceId,Name:Tags[?Key=='Name']|[0].Value,State:State.Name,PublicIP:PublicIpAddress,PrivateIP:PrivateIpAddress}" \
    --output table

Example:

----------------------------------------------------------------------------------------------
|                                      DescribeInstances                                     |
+----------------------+---------------------------+-------------+----------------+----------+
|          ID          |           Name            |  PrivateIP  |   PublicIP     |  State   |
+----------------------+---------------------------+-------------+----------------+----------+
|  i-0770b293b7b6383e0 |  MemVergeAI-Management01  |  10.0.0.156 |  3.20.192.186  |  running |
|  i-02a83e5064fccd806 |  MemVergeAI-GPU-Worker01  |  10.0.0.9   |  3.128.242.144 |  running |
+----------------------+---------------------------+-------------+----------------+----------+

10. Renaming AWS EC2 Hostnames (Optional)

The default hostnames created by AWS are not intuitive for the MemVerge.ai cluster. You can rename your AWS EC2 instances to more intuitive hostnames like mvai-mgmt and mvai-nvgpu01. This will make your cluster management more manageable.

  1. Update the hostname on each instance

    SSH into each EC2 instance and run the following commands:

    sudo hostnamectl set-hostname new-hostname 
    

    Replace "new-hostname" with your desired hostname (e.g., MemVerge.ai-mgmt, MemVerge.ai-node001).

  2. Update /etc/hosts file

    Edit the /etc/hosts file and add a line with the new hostname below the default 127.0.0.1 localhost line:

    127.0.0.1 new-hostname
    
  3. Update DNS settings (Optional)

    If you're using Amazon Route 53 or another DNS service, update the DNS records to reflect the new hostnames.

  4. Reboot the host:

    sudo systemctl reboot
    
  5. When the system boots, verify the new hostname is correct:

    hostnamectl
    

10.1 Updating /etc/hosts on All Nodes

To ensure proper communication between nodes in your cluster, you must add the hostnames and IP addresses of all nodes to the /etc/hosts file on each system. This step is crucial when not using DNS for hostname resolution. If you use DNS, this step is not required. Ensure your DNS entries are correct.

  1. Gather the private IP addresses and hostnames of all nodes in your cluster using ip a.

  2. SSH into each node (management and worker nodes). The default user for Ubuntu Linux is ubuntu:

    ssh ubuntu@<node-ip>
    
  3. On each node, edit the /etc/hosts file:

    sudo vim /etc/hosts
    
  4. Add entries for all nodes in your cluster. The format is:

    <private-ip> <hostname>
    

    For example, add these lines:

    # MemVerge.ai Cluster IP Addresses and Hostnames
    172.31.11.149 mvai-mgmt
    172.31.19.68 mvai-nvgpu01
    

    Add an entry for each node in your cluster, including the node you're currently editing.

  5. Save the file and exit the editor.

  6. Repeat steps 2-5 for each node in your cluster.

  7. Verify the changes by pinging other nodes using their hostnames:

    ping mvai-mgmt
    ping mvai-nvgpu01
    

    Ensure that each node can ping all other nodes using their hostnames.

    By adding these entries to /etc/hosts on all systems, you ensure that each node can resolve the hostnames of other nodes in the cluster. This is crucial for Kubernetes and other cluster components to communicate properly.

    Remember to update the /etc/hosts file on all nodes whenever you add or remove nodes from your cluster. While this manual process works well for smaller, static clusters, using DNS is generally preferred for larger or more dynamic environments.


11. Create an EFS File System

To have a persistent shared file system that can be mounted by all cluster nodes, create a new AWS Elastic File System (EFS). Amazon EFS is a scalable, elastic file system that multiple instances can access simultaneously. You can use it for shared storage among your cluster nodes to store snapshots and other data that needs to be accessible by the user workloads.

  1. Create the file system:

    FILE_SYSTEM_ID=$(aws efs create-file-system \
       --performance-mode generalPurpose \
       --throughput-mode bursting \
       --encrypted \
       --tags Key=Name,Value=$FILE_SYSTEM_NAME \
       --query 'FileSystemId' \
       --output text)
    
    echo "Created EFS with ID: $FILE_SYSTEM_ID"
    

    Example:

    Created EFS with ID: fs-06089fdf3a7751a5f
    
  2. Create a Mount Target in the same subnet:

    aws efs create-mount-target \
       --file-system-id $FILE_SYSTEM_ID \
       --subnet-id $SUBNET_ID \
       --security-groups $SG_ID
    

    Example output:

    {
        "OwnerId": "669102733081",
        "MountTargetId": "fsmt-06bfc187a478c98e4",
        "FileSystemId": "fs-06089fdf3a7751a5f",
        "SubnetId": "subnet-01f24fb72235228ed",
        "LifeCycleState": "creating",
        "IpAddress": "10.0.0.190",
        "NetworkInterfaceId": "eni-098e4d7ff9692af99",
        "AvailabilityZoneId": "use2-az1",
        "AvailabilityZoneName": "us-east-2a",
        "VpcId": "vpc-01bdeafcc0ce883e5"
    }
    
  3. Obtain the DNS Name of the EFS File system. This is required by the Management and Worker nodes to mount it later.

EFS_DNSNAME="${FILE_SYSTEM_ID}.efs.${REGION}.amazonaws.com"
echo "The EFS DNS Name is: $EFS_DNSNAME"

Example:

The EFS DNS Name is: fs-06089fdf3a7751a5f.efs.us-east-2.amazonaws.com

12. Mounting the EFS Volume on Management and GPU Worker Nodes

After creating your EFS file system and mount target, you can mount it on both instances (the management node and the GPU worker node) so they share the same persistent storage. You will need to SSH to each Management and Worker node to perform these actions.

NOTE

If you are using a web Cloud Shell environment, click '+' to get a new terminal. This avoids losing any shell environment variables created during the installation process so far, which you need for future steps

  1. Install NFS Utilities

    On Ubuntu 22.04, the EFS mount requires nfs-common:

    sudo apt update && sudo apt install -y nfs-common
    
  2. Create a Mount Directory

    Create a local mount point (e.g., /mnt/efs) on each node:

    sudo mkdir -p /mnt/efs
    
  3. Determine the EFS Mount Endpoint

    3.1. Using EFS DNS Name
    By default, Amazon EFS provides a DNS name in the format:

    <filesystem-id>.efs.<region>.amazonaws.com
    

    For instance, if your $FILE_SYSTEM_ID is fs-06089fdf3a7751a5f and your $REGION is us-east-2, the EFS endpoint would be:

    fs-06089fdf3a7751a5f.efs.us-east-2.amazonaws.com
    

    3.2. Optional: Using the Mount Target IP
    As shown in your creation output, the IpAddress might be 10.0.0.190. You can mount using that IP directly, but it’s generally better to rely on the DNS name for high availability and automatic failover between Availability Zones.

  4. Mount the EFS File System

    Use the following command example to mount EFS on each node. Replace the DNS with the one displayed in the previous step:

    sudo mount -t nfs4 -o nfsvers=4.1 \
       fs-06089fdf3a7751a5f.efs.us-east-2.amazonaws.com:/ \
       /mnt/efs
    

    Replace:

    • fs-06089fdf3a7751a5f.efs.us-east-2.amazonaws.com with your actual EFS DNS endpoint.
    • /mnt/efs with the directory you wish to mount on, if different.

    Tip: Confirm the mount is successful:

    df -hT | grep efs
    

    You should see an entry similar to:
    fs-06089fdf3a7751a5f.efs.us-east-2.amazonaws.com:/ nfs4 … /mnt/efs

  5. Persist the Mount in /etc/fstab

    To ensure the EFS file system automatically remounts after reboot or instance stop/start, add an /etc/fstab entry on each node:

    echo "fs-06089fdf3a7751a5f.efs.us-east-2.amazonaws.com:/ /mnt/efs nfs4 defaults,_netdev 0 0" | sudo tee -a /etc/fstab
    
    • _netdev ensures the system knows this mount requires a network connection before mounting.
    • You can add additional options (e.g., rsize=1048576, wsize=1048576) if needed, but the above defaults typically suffice.

    Once added, test the fstab entry by unmounting and remounting:

    sudo umount /mnt/efs
    sudo mount -a
    

If successful, EFS should remount without errors. Use df -hT | grep efs to confirm the file system is mounted.

Remember to repeat this process on all Management and Worker nodes before proceeding!


13. (Optional) Create a Load Balancer for the Management Node(s)

When running multiple control-plane (management) nodes, you want a single, stable endpoint for client or API access. An AWS Network Load Balancer (NLB) is well-suited for load-balancing TCP traffic—such as the Kubernetes API on port 6443. Alternatively, if you plan to expose HTTP/HTTPS services directly from the control plane, you might prefer an Application Load Balancer (ALB). The instructions below use an NLB for simplicity.

13.1 Create a Network Load Balancer for the Kubernetes API

If you do not require a high-available control/management plane setup, skip this step and proceed to Step 13.

A Network Load Balancer operates at Layer 4 (TCP). It passes traffic quickly and efficiently to multiple backend instances (in this case, your management nodes). This setup also helps facilitate a High Availability environment by ensuring traffic is directed only to healthy nodes.

  1. Create the NLB:

    LB_ARN=$(aws elbv2 create-load-balancer \
       --name MemVergeAI-NLB \
       --type network \
       --scheme internet-facing \
       --subnets $SUBNET_ID \
       --query 'LoadBalancers[0].LoadBalancerArn' \
       --output text)
    
    echo "Created NLB with ARN: $LB_ARN"
    
    • This places the load balancer in the public subnet you created earlier ($SUBNET_ID).
    • We use internet-facing so it can receive traffic from external clients.

    Example output:

    Created NLB with ARN: arn:aws:elasticloadbalancing:us-east-2:669102733081:loadbalancer/net/MemVergeAI-NLB/38511461f2be2c06
    
  2. Create a Target Group:

    TG_ARN=$(aws elbv2 create-target-group \
       --name MemVergeAI-NLB-Targets \
       --protocol TCP \
       --port 6443 \
       --vpc-id $VPC_ID \
       --query 'TargetGroups[0].TargetGroupArn' \
       --output text)
    
    echo "Created Target Group with ARN: $TG_ARN"
    
    • Here we specify TCP on port 6443, which is the typical port for the K3s (and Kubernetes) API.
    • Adjust the port if your management node listens on a different one.

    Example output:

    Created Target Group with ARN: arn:aws:elasticloadbalancing:us-east-2:669102733081:targetgroup/MemVergeAI-NLB-Targets/a0f6cb75468d3df1
    
  3. Register Your Existing Management Node:

    aws elbv2 register-targets \
       --target-group-arn $TG_ARN \
       --targets Id=$MGMT_INSTANCE_ID,Port=6443
    
    • This tells the NLB to forward incoming traffic on port 6443 to the management node’s instance ID on the same port.
  4. Create a Listener:

    aws elbv2 create-listener \
       --load-balancer-arn $LB_ARN \
       --protocol TCP \
       --port 6443 \
       --default-actions Type=forward,TargetGroupArn=$TG_ARN
    
    • The listener watches for traffic on port 6443 on the load balancer and forwards it to your target group ($TG_ARN).

Once created, the NLB will have its own DNS name. You can retrieve it by running:

LB_DNS_NAME=$(aws elbv2 describe-load-balancers \
    --load-balancer-arns $LB_ARN \
    --query 'LoadBalancers[0].DNSName' \
    --output text)

echo "NLB DNS Name: $LB_DNS_NAME"

13.2 Update Your DNS to Point to the Load Balancer

Rather than pointing your DNS A record to a single management node's IP, you can point it to the NLB. This way, if you add or remove management nodes, the DNS stays the same, and the load balancer handles routing.

  1. Create or Update an A Record in Route53 (if using Route53):

    # Example assumes your hosted zone is in $HOSTED_ZONE_ID
    # and you want "ai.example.com" to resolve to the load balancer.
    
    aws route53 change-resource-record-sets \
       --hosted-zone-id $HOSTED_ZONE_ID \
       --change-batch '{
         "Changes": [{
           "Action": "UPSERT",
           "ResourceRecordSet": {
             "Name": "ai.example.com.",
             "Type": "A",
             "AliasTarget": {
               "HostedZoneId": "Z26RNL4JYFTOTI",  # Example ALB/NLB hosted zone ID for us-east-2
               "DNSName": "'"$LB_DNS_NAME"'",
               "EvaluateTargetHealth": false
             }
           }
         }]
       }'
    
    • Note that Alias records for ALB/NLB require the correct HostedZoneId for the AWS region (e.g., us-east-2). Check AWS Documentation for the correct “ALB/NLB Hosted Zone ID.”
    • Alternatively, if you cannot use Alias records, you can create a standard A record pointing to the NLB DNS name using a CNAME. However, an Alias record is usually recommended.
  2. Using a 3rd Party DNS Provider:

    • Go to your DNS provider’s dashboard.
    • Create or edit an A record for ai.example.com to reference the NLB’s DNS name.
    • If the provider doesn’t allow an ALIAS or ANAME record, you might need a CNAME that points to the NLB’s domain (e.g. xxx.elb.amazonaws.com).
    • Wait for DNS propagation before testing.

13.3 Adding Another Management Host in the Future

As your environment grows or you need High Availability, you might spin up a second or third management node. Each new node should be included in the same NLB target group so traffic to port 6443 is distributed among all management nodes.

  1. Launch another management EC2 instance as you did before (e.g., m5.xlarge in the same VPC/subnet).
  2. Register the new instance with the existing target group:

    # Suppose NEW_MGMT_INSTANCE_ID is the new management node's instance ID
    aws elbv2 register-targets \
       --target-group-arn $TG_ARN \
       --targets Id=$NEW_MGMT_INSTANCE_ID,Port=6443
    
  3. Validate health checks:

    • The NLB runs health checks against the specified port (6443 by default). Make sure the new node is healthy and able to serve K3s traffic.
    • You can check the health status via:
    aws elbv2 describe-target-health --target-group-arn $TG_ARN
    
  4. Scale as needed: You can add more management nodes the same way. The NLB automatically starts routing traffic once they pass health checks.

IMPORTANT

For a true High-Availability K3s setup, ensure your additional management nodes are configured properly at the K3s layer. Simply adding more control-plane instances behind the load balancer is only half the equation; each new node must be joined as a server in K3s (not just as a worker). Follow K3s’s official documentation for the correct HA setup procedure.


14. Handling HTTP and HTTPS with an Application Load Balancer and Traefik (for Kubernetes)

When you install K3s with the default configuration, it typically includes the Traefik ingress controller. This component listens on ports 80 and 443 inside the cluster for incoming HTTP and HTTPS connections. By default, K3s might expose Traefik via a Service of type LoadBalancer or NodePort, depending on your configuration. You will install K3s next, so we will create a HTTP/HTTPS load balancer once Kubernetes is operational. For now, no further action is needed.


15. Update Your DNS Records

If you did not create a load balancer and assign a DNS entry in Step 13, follow this step. Otherwise, Continue to Step 16.

If you own a domain and want to serve HTTPS requests to your cluster without warnings, you’ll need to add a DNS A record pointing to the Elastic IP (or load balancer) of the Management Node(s). Below is an example approach:

  1. Using Route53:
    If your domain is hosted in Route53, create (or update) an A record in the relevant Hosted Zone:

    # This example assumes you have a Hosted Zone ID in the $HOSTED_ZONE_ID variable
    # and want to point 'ai.example.com' to your management node's EIP.
    
    aws route53 change-resource-record-sets \
       --hosted-zone-id $HOSTED_ZONE_ID \
       --change-batch '{
         "Changes": [{
           "Action": "UPSERT",
           "ResourceRecordSet": {
             "Name": "ai.example.com.",
             "Type": "A",
             "TTL": 300,
             "ResourceRecords": [{"Value": "'"$MGMT_PUBLIC_IP"'"}]
           }
         }]
       }'
    
  2. Using Another DNS Provider:

    • Log into your DNS provider’s dashboard.
    • Create an A Record for e.g. ai.example.com and set its value to your management node’s public IP.
    • Wait for DNS propagation (can take a few minutes up to an hour).

Once the DNS record is in place, you can connect to your management node using ai.example.com (or whichever hostname you chose).


16. Verify Connectivity

After a few minutes, you should be able to:

  1. SSH into the management node using:

    ssh -i ${SSH_KEY_NAME}.pem ubuntu@$MGMT_PUBLIC_IP
    

    or if DNS is set:

    ssh -i ${SSH_KEY_NAME}.pem ubuntu@ai.example.com
    
  2. Ping the worker node from the management node (and vice versa) using private IPs:

    # On the management node, do:
    ping <worker-private-IP>
    
    # On the worker node, do:
    ping <management-private-IP>
    

    They should communicate freely within the VPC.

  3. Access the Internet from your EC2 instances (e.g., curl https://www.google.com).


17. Update the Operating System

  • (Recommended) OS Patches: SSH into each instance and run:

    sudo apt update && sudo apt upgrade -y
    

    Do not run a major-release distro upgrade (e.g., from 22.04 to 22.10).

Conclusion

Congratulations! Your AWS environment is now ready to deploy Kubernetes and MemVerge.ai. Proceed to the next steps in this Installation Guide to continue the installation process.