High-performance Computing Mode

High-performance Computing (HPC) Mode enables an MMCloud subscriber to use the OpCenter to create (and manage) a cluster of compute nodes that execute jobs scheduled from an input queue.

Feature Description

An HPC cluster assembles many separate servers (nodes), connected via a fast network, into a high-availability computing environment specialized for parallel processing of large data sets.

The HPC cluster is composed of three types of nodes.

One or more Login nodes from which users submit and manage jobs
One Head node (also known as the Control Node) that provides the scheduling service (for example, SLURM)
One or more Compute nodes which are dynamically created and destroyed in response to the workloads applied

The compute nodes in the cluster can have different capabilities.

A range of processing power, that is, nodes with different CPU types and quantities
Large memory capacity nodes (so-called fat nodes) with multiple TBs of DRAM
Nodes with specialized processors, such as GPUs

Two services are mandatory for the operation of the HPC cluster. These services are not instantiated as part of MMCloud's HPC Mode and must be configured separately.

Directory Service (such as LDAP or NIS) authenticates all workload activities (such as login, job submission, job cancellation, and so on)
Shared File System (such as NFS) provides shared storage so that an application can access the same directory path on every node

Optionally, you can also start an SMTP service so that users receive email updates on the status of their jobs.

The HPC cluster maintains a system of queues to schedule and maintain tasks. The current implementation of MMCloud's HPC Mode uses the Slurm Workload Manager as the queuing system.

When you submit a job to the HPC cluster, the queue manager assigns the job to a queue. The scheduler then assigns the resources you request (for example, numbers of CPUs and GPUs, and memory capacity) subject to the job’s priority and the availability of resources.

Operation

The components and operation of the HPC cluster are shown in the figure.

HPC Cluster Architecture

A user of the HPC cluster always starts by logging in (using SSH) to the login node. From the login node, the user (the owner of the job) submits the job to the head node. The head node assigns the job to a queue and eventually the job runs on one or more compute nodes. The login node never runs any workloads. The head node maintains information on job status and related accounting data.

The compute nodes run the job workloads. There are no compute nodes when the cluster starts. Compute nodes are created and destroyed as needed by the workload and its resource demands.

Configuration

To deploy an HPC Cluster using the OpCenter, complete the tasks in the following sections.

Complete Preliminary Tasks

Note

The OpCenter uses a custom virtual machine image (ami in AWS terminology) and a custom container image (called mm-hpc) to create HPC cluster nodes. Your MemVerge sales team must make the ami and container image available to your CSP account in your preferred region.

Before creating an HPC cluster, you must complete the following preliminary tasks.

Start an OpCenter running MMCloud release 3.1.0 or later. If you upgrade from an earlier release, run the script to upgrade to the SQL database (see Upgrade to the 3.1 release).
Confirm that the influxdb and postgres databases are running on the OpCenter by completing the following.
- ssh in to the OpCenter
  ssh -i "PEM_FILE" mmc@OPCENTER_IP_ADDRESS
- Check that influxdb is running
  ps -aux | grep influx
- Check that postgres is running
  ps -aux | grep postgres

Upload the customized AMI that cluster nodes run on.

Each region uses a different AMI and container image. For example, if you work in AWS regions us-east-1 or 2, use the following identifiers (container images are stored as snapshots). For other regions, contact your MemVerge sales team.

Region	AMI ID	Image Snapshot
aws-us-east-1	ami-01e1d06675cb3b3ce	snap-0fcaa80f7ddb81990
aws-us-east-2	ami-0c54c8a1d31434a24	snap-08d216f3e23bbe5cc

Create a text file, called mi-cluster.yaml, to describe the parameters related to the AMI.

Here is an example of mi-cluster.yaml to use in us-east-1. Use as a template and edit the parameters that are different for your region.

$ cat mi-cluster.yaml
# mi-cluster.yaml
aws:
    us-east-1:
        - id: ami-01e1d06675cb3b3ce
        name: cluster
        description: Image for MMC cluster nodes
        tags:
            feature: cluster_2.1
            os: centos_7.9
        source: private
        cloudMachineImage:
            arch: x86_64
            id: ami-01e1d06675cb3b3ce
            name: EDA-2.8
            description: EDA user node image 2024-09-14
            tags:
                Name: EDA-2.8

Update the OpCenter’s AMI list by entering the following command.

$ float mi update -f mi-cluster.yaml
updated machine image file successfully

Check that the cluster node AMI is available on the OpCenter by entering the following command.

$ float mi list
aws:
    us-east-1:
        [deleted lines for other ami entries]
        - id: ami-01e1d06675cb3b3ce
        oid: mi-eggvrzt0ybejp3xhmc2g7
        name: cluster
        description: Image for MMC cluster nodes
        tags:
            feature: cluster_2.1
            os: centos_7.9
        source: private
        cloudMachineImage:
            arch: x86_64
            id: ami-01e1d06675cb3b3ce
            name: mmc-cluster-node_centos79_hpc311
            description: '[Copied ami-0c54c8a1d31434a24 from us-east-2] mmc-cluster-node_centos79_hpc311'
            tags: {}
        updateTime: 2025-05-14T03:01:21.982529621Z

To start a virtual machine using the cluster node AMI, you must subscribe to "CentOS 10 Minimal Latest (Free) sold by Hanwei software technology (Hong Kong) Co., Ltd."
- Log in to your AWS Management Console and go to the AWS Marketplace
- In the left-hand panel, click Discover Products
- Search for "CentOS 10 Minimal Latest (Free)"
- From the results, choose "CentOS 10 Minimal Latest (Free)"
- At the top, on the right-hand side, click View purchase options
- Subscribe to this software
Upload the customized container image (called mm-hpc) that the cluster nodes run (the image is stored as a snapshot).
Identify the snapshot ID for the container image, for example, use snap-0fcaa80f7ddb81990 for aws-us-east -1.

Upload mm-hpc to the OpCenter's Application Library by entering the following command (change the snapshot ID for your region).

$ float image add mm-hpc localhost/mm-hpc:1.0.17 --link snapshot://snap-0fcaa80f7ddb81990
name: mm-hpc
uri: localhost/mm-hpc
owner: admin
tags:
    1.0.17:
        status: Ready
        uri: snapshot://snap-0fcaa80f7ddb81990
        locked: false
        lastUpdated: 2025-05-14T02:49:50.375266066Z
        lastPushed: 2025-05-14T02:49:50.375266115Z
        size: 6.00 GB

[Optional, but recommended] Define the set of allowed instance types in the OpCenter configuration settings. This constrains the range of instance types allowed for cluster nodes. For example, to allow only instances in the "t" family, enter the following.
```
$ float config set provider.allowList "t*"
allow list is set to [t*]
```

Deploy Mandatory Services

Two mandatory services are required by HPC Mode.

Directory Service (such as LDAP or NIS)
Shared file system (such as NFS)

How to deploy these services is beyond the scope of this guide. The following references may be useful.

Note

The OpCenter uses the semantics of POSIX-compliant systems to describe groups for authentication and authorization purposes. A posixGroup is an object class that represents a UNIX-style group. The OpenLDAP schema uses different group semantics that are not compatible. Include the nis schema in your LDAP server configuration to enable POSIX-style semantics.

[Optional] Deploy SMTP Service

If available, the HPC cluster can use SMTP (Simple Mail Transfer Protocol) for sending email notifications about job status, errors, or completion. Email updates help users stay informed about jobs and allows them to address any issues promptly. To enable SMTP service, you must deploy an SMTP server.

Note

The mandatory and optional services may already be available as part of your IT infrastructure. If not, you may have to deploy these services specifically to support HPC Mode. In this case, you must deploy these services in compliance with your organization's IT security policy, for example, whether or not to allow public IP addresses.

Register Directory Service on the OpCenter

After you have deployed an LDAP (or NIS) server, you must register your preferred directory service (for example, LDAP) on the OpCenter. The registration process is similar for LDAP and NIS. You can use the web interface or the float CLI to register. See the section on Authentication for the registration process for LDAP.

Note

The settings on your LDAP (or NIS) server determine some of the parameters you must specify when registering a directory service on the OpCenter (for example, whether to allow anonymous queries).

After you complete the registration, confirm the registration using the CLI or the web interface. For example,

bash-3.2$ float ldap list
- id: ldap-s1z91wu1
  name: test1
  createTime: 2025-05-21T01:23:02.325638051Z
  network: tcp
  addr: 192.168.1.100:389
  useTLS: false
  anonymous: false
  bindDN: cn=Manager,dc=my-domain,dc=com
  bindPW: '********'
  base: dc=my-domain,dc=com
  adminGroup: ""
  peopleOU: ou=Memverge
  groupOU: ou=Groups
  connTimeout: 0s
bash-3.2$ float nis list
No NIS config

On the web interface, log in as admin. From the left-hand panel, select SERVICE->Security, and then click the LDAP or NIS button on the Security screen to display the registered directory services.

Register NFS Filesystem on the OpCenter

After you have deployed an NFS server, you must register the exported directory as Storage on the OpCenter. You can use the web interface or the float CLI to register.

With the CLI, complete the following steps.

Check the following items.
- There is network connectivity between the private IP addresses of the NFS server and the OpCenter.
- The NFS server allows inbound access to port 2049 (apply an appropriate security group if needed).
- The subnet mask in /etc/exports on the NFS server allows a cluster node to mount file systems from the NFS server.
Log in to the OpCenter as admin.
Use the float storage register command with the -D or --dataVolume option as follows:

float storage register --name NAME --dataVolume nfs://NFS_PRIVATE_IP/EXPORTED_DIR:/MOUNTED_DIR

Replace:
- NAME: name to identify filesystem
- NFS_PRIVATE_IP: private IP address of the NFS server
- EXPORTED_DIR: directory (or path to directory) exported by the NFS server
- MOUNTED_DIR: mount point (or path to mount point) where cluster node mounts the exported directory
Example:
```
float storage register --name nfsserver --dataVolume nfs://172.31.53.99/mnt/memverge/shared:/homehpc
```
Check that the filesystem is available by entering float storage list

Equivalently, open the web interface.

Log in as admin
From the left-hand panel, select Storage
On the Storage screen, click Register on the right-hand side
In the pop-up window, fill in the fields as follows:
- Name: name to identify file system
- Storage Type: select NFS from the drop-down menu
- URL: NFS_PRIVATE_IP:/EXPORTED_DIR (include the colon)
- Mount Options: leave blank unless you have specific mount requirements
- Mount Point: /MOUNTED_DIR
- Access Mode: use the drop-down menu to select Read and Write

Note

All cluster nodes, including the login node, have read-write access to /MOUNTED_DIR. After you ssh in to the login node, you must submit jobs from /MOUNTED_DIR. You can configure users in the LDAP directory with their home directories set to /MOUNTED_DIR so that when users ssh in to the login node, they are automatically placed in the /MOUNTED_DIR directory.

[Optional] Register SMTP Server on the OpCenter

The HPC cluster supports sending email notifications triggered by job events. When submitting a job, a user can specify the event types to trigger notifications to a list of recipients. Here is an example:

sbatch --mail-type=ALL --mail-user=bob@gmail.com run_job.sh

You can register an SMTP server using the float cli or the web interface.

Note

The settings on your SMTP server determine some of the parameters you must specify when registering SMTP on the OpCenter (for example, whether to use SSL or TLS).

Use the float CLI as follows:

$ float smtp add -h
Add a new smtp config

Usage:
float smtp add [flags]

Flags:
    --from string       "sent from" email address
-n, --name string       name of the SMTP config
-P, --password string   password to authenticate the user of the SMTP server
    --passwordStdin     prompt for SMTP user password
    --port int          SMTP server port
-S, --server string     SMTP server IP address
    --useSSL            use SSL
    --useTLS            use TLS
-U, --user string       user of SMTP server

Use the web interface as follows.

Log in to the web interface as admin
In the left-hand panel, click SERVICE->SMTP
On the SMTP screen, click Register SMTP
In the pop-up window, fill in the fields and then click Verify Connection

Create an HPC Cluster

Although you can use the float CLI to configure an HPC cluster, it is much easier to use the web interface.

Note

You have the option of configuring public or private IP addresses for cluster nodes. You must comply with your organization's IT security policy. For example, your IT security policy may allow public IP addresses for login nodes, but require private IP addresses only for head and compute nodes.

To create an HPC cluster using the web interface, complete the following steps.

Log in to the web interface as admin
At the bottom of the left-hand panel, there is a toggle switch to turn HPC Mode on or off. Turn HPC Mode on.
Re-log in to the web interface as admin
In the left-hand panel, click Clusters
At the top, right-hand side, click Create Cluster

A pop-up window opens to show the four-step process to create an HPC cluster (shown in figure). You are at the first step: Security
Select your authentication method (LDAP or NIS) and then (in the Configuration section) click Select from existing
In the table of registered LDAP (or NIS) configurations, choose the directory service you registered previously
Click Next

You are now at the second step: Cluster
Select the Basic tab

The Advanced tab allows you to adjust the default values of parameters associated with the HPC cluster
Fill in the fields as follows.
- Name: enter a name to identify the cluster
- Machine Image: use the drop-down menu to select the cluster AMI for your region
- App Images: use the drop-down menu to select mmc-hpc and any other container images to use in the cluster
- Instance Types: enter the EC2 instance types to use as cluster nodes, for example, t3.medium, t3.large
- Time Zone: use the drop-down menu to enter your time zone
- Head Nodes: enter the instance type to use for the Head Node (keep the number of Head Nodes at one) and whether to use public or private IP addresses
- Login Nodes: click Add and then enter the instance type to use for the Login Node(s), choose the number of Login Nodes, and whether to use public or private IP addresses
- Compute Nodes: choose whether to use public or private IP addresses
- Storage Volumes: click Select from existing and then choose the shared filesystem (NFS) you registered previously
- Security Groups: click Add and then enter the security group(s) to use as the default security group(s) for cluster nodes. If not configured, the inbound rules from mvWorkerNodeSecurityGroup are used.
- Subnets: click Add and then enter the subnet to use for cluster nodes. If not configured, the OpCenter chooses randomly.
- QoS: set maximum limits for CPU cores and memory capacity for compute nodes in the cluster. Default: no limits.
Click Next

You are now at the third step: Queues
Click Add Queue and fill in the fields as follows.
- Name: enter a name for the queue, for example q1
- Instance Types: delete or add instance types to use for compute nodes in this queue
- Max Nodes: set maximum number of compute nodes that can be running simultaneously for this queue. Default: no limit.
- Max Idle time: set maximum time that compute node can be idle before it is reclaimed. Default: ten minutes.
- VM Policy: choose the policy for selecting compute nodes, for example, Spot First, and associated parameters
- Storage Volumes: change the toggle switch to Queue, click Select from existing, and select the shared filesystem you registered earlier
  
  You can also configure local storage for each compute node by clicking Add and selecting Volume-New. Choose the size and enter the mount point, for example, /temp-dir.
- Security Groups: change the toggle switch to Queue, click Add, and enter the security group(s) to apply to every compute node. If not configured, the inbound rules for mvWorkerNodeSecurityGroup are used.
- Subnets: change the toggle switch to Queue, click Add, and then enter the subnet to use for cluster nodes. If not configured, the OpCenter chooses randomly.
- QoS: set maximum limits for CPU cores and memory capacity for compute nodes in the cluster. Default: no limits.
- Swap file size: Leave at 0. This means the OpCenter automatically calculates the swap size.
- Hyperthreading: Keep the setting at Auto
Click Add
In the pop-up window, use the pull-down menu to select a default queue
Click Next

You are now at the fourth step: Preview
Click Create

Modify an HPC Cluster

At any time after you start the HPC Cluster, you can "modify" the parameters or "modify and reconfigure." If you modify the parameters only, you must manually reconfigure the cluster for the changes to take effect.

To reconfigure the HPC cluster, complete the following steps.

Click Clusters in the left-hand panel
Identify your cluster and then click the Settings button under the Actions column (on the right-hand side)
In the pop-up window, click Reconfigure

Run "Hello World" job on the HPC Cluster

To test the HPC cluster, submit a "Hello World" job by completing the following steps.

ssh in to the Login Node by entering the following at a terminal prompt
```
ssh USER@LOGIN_NODE_IP_ADDRESS
```
Replace:
- USER: a normal or admin user whose credentials are configured in the directory service (for example, LDAP)
- LOGIN_NODE_IP_ADDRESS
Enter the user's password at the prompt

Check that the shared filesystem is mounted by entering the following:

$ df -h
Filesystem                          Size  Used Avail Use% Mounted on
devtmpfs                            3.8G     0  3.8G   0% /dev
tmpfs                               3.8G   84K  3.8G   1% /dev/shm
tmpfs                               3.8G  409M  3.4G  11% /run
tmpfs                               3.8G     0  3.8G   0% /sys/fs/cgroup
/dev/nvme0n1p3                       39G   18G   22G  45% /
/dev/nvme0n1p2                     1014M  198M  817M  20% /boot
172.31.45.42:/mnt/memverge           50G  3.0G   47G   6% /mnt/memverge
tmpfs                               772M     0  772M   0% /run/user/997
172.31.18.243:/nfs/exports/myshare  8.8G  1.7G  7.1G  19% /homehpc

Check that the current working directory is the shared filesystem (if not enter cd /homehpc, for example)
```
$ pwd
/homehpc
```

Create a file called hellocluster.sh and enter the following content:

#!/bin/bash
#
#SBATCH --job-name=hellocluster
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=1G
#SBATCH --time=00:10:00
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err

touch /homehpc/hellocluster.txt
echo "Hello World. This is the HPC cluster speaking." > /homehpc/hellocluster.txt
echo "Hello Output"
echo "Hello Error" 1>&2

Submit the job to the HPC cluster by entering:

$ sbatch hellocluster.sh
Submitted batch job 45

Confirm that the HPC cluster starts one compute node by viewing the Clusters screen on the web interface and clicking your HPC cluster
Check the status of the job viewing the Jobs screen on the web interface and clicking the entry in the Host column associated with the hellocluster job
Check standard output
```
$ cat hellocluster_45.out
Hello Output
```
Check standard error
```
$ cat hellocluster_45.err
Hello Error
```

Check that the output exists on the NFS server

 $ cat /nfs/exports/myshare/hellocluster.txt
 Hello World. This is the HPC cluster speaking.