High Performance Computing for Easy Deployments

Introduction

In December 2024, just before the holidays, we announced the general availability of CloudFormation support in Parallel Computing Service (PCS)—our newest high performance computing (HPC) managed service. This service makes it easy to deploy and scale your HPC workloads, as well as scientific and engineering models, in the cloud.

We believe this is a significant milestone. However, if you haven’t worked with Infrastructure as Code (IaC) before, you may wonder why this is important. IaC allows you to define your infrastructure using readable, editable, and version-controlled code. The cloud then takes care of provisioning resources—often in minutes—with minimal manual effort.

Today, we will highlight some powerful examples of IaC from our HPC Recipes Library. Some of these recipes can spin up a fully configured HPC cluster faster than it takes for your pizza to arrive! You can enjoy your pizza while exploring these recipes and deciding how to adapt them to your workflow. Since our HPC Recipes Library is open-source, you can customize and optimize them as needed to get your cluster up and running efficiently.

What Has CloudFormation Ever Done for High Performance Computing?

Before we dive into the recipes, it’s essential to understand the benefits of IaC when deploying HPC environments.

With CloudFormation support in Parallel Computing Service, creating a high performance computing cluster is now simpler than ever. Instead of configuring everything manually, you can deploy entire HPC environments with just a few commands or clicks using predefined stack sets or recipes.

Key Benefits of Using IaC for HPC

Simplicity and Automation

With CloudFormation, HPC cluster deployment becomes as seamless as deploying an application. Users don’t need to understand every intricate configuration step—just run the deployment script, and the infrastructure is provisioned.

Consistency and Reproducibility

Every cluster deployment produces identical results, eliminating inconsistencies between development, testing, and production environments. This also ensures seamless disaster recovery and simplifies multi-region scaling.

Security and Compliance

With IaC, security best practices can be codified and deployed consistently. Instead of relying on manual steps to secure a cluster, security configurations can be predefined, reducing human error. Additionally, IaC enables comprehensive auditing and automated compliance reporting.

Now, let’s explore three HPC recipes from our HPC Recipes Library, each demonstrating different aspects of high performance computing with CloudFormation.

Three HPC Recipes from Our Library

1. Getting Started

The Getting Started recipe replicates our popular tutorial experience – without the assembly steps. This recipe gives you a basic but fully functional cluster that’s perfect for learning and testing. It’s an ideal starting point if you’re new to PCS or want to validate your environment.

Every cluster spun up with these recipes creates a new Virtual Private Cloud (VPC), which is the environment in which we build the clusters. You don’t need a new VPC just to use PCS – we’ve just included a new, clean, juiced-up one in these recipes mainly so we don’t bump into unnecessary limits when we’re spinning up the cluster. It’s also a good idea to initially run these recipes as a user with Administrator privileges, at least until you finish experimenting and turn to narrowing down to focus on your environment, and its particular security (or compliance) needs.

Scheduler – Every cluster launched comes with its own managed Slurm controller, running in a cloud service account, which means we’re managing it and – of course – looking after updates and upgrades for you when they come along.

Compute nodes – We pre-configure every Getting Started cluster with two compute node groups (CNGs). The first spins up a single login node so you have something to connect to for compiling code or submitting jobs. The second CNG is more elastic: it contains from zero to four compute nodes. The number of nodes will depend on the jobs you submit to Slurm, but will default to zero – which is ideal for a cloud HPC environment, which really should be consuming as little as possible when there are no jobs to run.

This recipe uses very small instances (c6i.xlarge) each with 4 vCPUs – making this a 16 vCPU cluster. Its main purpose is to help you learn about the knobs and dials in PCS, so we aimed for small, low-cost instances. You can easily change this, though, by just editing the compute node group definition in the PCS console. You can also clone a compute node group in the PCS console, changing the instance types it draws from along the way.

Networking – where possible, we use instances that support Elastic Fabric Adapter (EFA), which is our custom-built network for HPC and ML applications. It’s not usually offered on smaller instance sizes, so the 4-core instances used by some recipes won’t offer you the blazing fast performance you might be looking for. But: “4-core instances” should have given you the idea that performance wasn’t our motivation when choosing them for this demonstration purpose. Larger instances do support EFA and where they do, we use it – and we provision instances in Cluster Placement Groups so your compute resources are close to each other physically, to minimize latency.

Storage – The recipe will also create two filesystems for your data. User data (/home) will live on an Amazon Elastic File System (Amazon EFS) NFS v4 filesystem, which has great performance for this purpose. Scratch data can use Amazon FSx for Lustre (‘persistent’ type, and mounted as /fsx). If the 1.2 TB sample Lustre isn’t enough for you, it’s easy to expand in the Amazon FSx console. You can also tweak the throughput and metadata IOPS while you’re there. Yes, that’s pretty cool.

SSH access – when the cluster comes up, you can choose to download the SSH key pair the recipe created for you, or you can connect to the login node using cloud services, which offers you a terminal in your web browser right from the console.

This HPC cluster is designed for learning, experimentation, and cost-efficiency while providing the core components needed for real-world high performance computing workloads.

2. AMD-Powered HPC: Industry Standard X86 Processors

The Try Amd recipe deploys a complete HPC environment built on Amd Milan processors – the current mainstream standard in high performance computing.

It has all the same architectural features of the Getting Started cluster, but differs in a few key aspects.

Like it says on the lid: this cluster is Amd-based, so the login node and all the compute nodes use Amd CPUs.

Next, it’s limited to using the us-east-2 region in Ohio, since this is one of the regions that provides access to the Hpc7a instance family.

This cluster has two Slurm partitions: small and large. The small partition sends jobs to nodes supplied by a c7a.xlarge node group. These are quite small (and very low cost) instances but without Elastic Fabric Adapter (EFA) networking. Great for kicking tires while you get used to how PCS works. The large partition sends work to a node group that uses hpc7a.48xlarge instances. These have EFA built in (dual-rail, in fact), delivering 300 Gbit/s, and come with 96 cores and 768 GB of RAM.

And as we discussed earlier, we provision instances in tight Cluster Placement Groups, so they’re physically close to each other.

The last thing to say is that Hpc7a instances require access-listing to use, so if your account isn’t already enabled for this EC2 family, you’ll probably need to cut a ticket to support or contact your account manager or solutions architect. If you submit jobs to the large queue and don’t see any instances spinning up to run jobs, chances are you’re not access-listed, or you may have a quota limit you’re exceeding. Either way, the cloud teams we just mentioned will help you figure it out.

3. Computing with Arm64 on Graviton

Arm-based processors are making big inroads into every corner of computing. In fact, in December at re:Invent, it was revealed that Graviton system deployments made up more than half of the growth of Amazon EC2 over the last two years.

The Try Graviton recipe lets you have an Arm-based cluster with full-sized serious processors which offer an impressive combination of performance, cost, and energy reduction. Instances using Gravitons cost around 40% less than comparable ones for the same workloads. They also use around 60% less energy, which helps thousands of customers with their carbon-reduction strategies.

Like the other two recipes we’ve just described, the cluster comes with all the same architectural features for storage, login nodes (a c7g.2xlarge in this case), and SSH access.

The compute node group supporting the Slurm queue uses hpc7g.16xlarge instances – these are built from 64-core Graviton3E processors and have 128 GB of RAM.

It also comes with Open MPI 5, which supports the Elastic Fabric Adapter (EFA), our custom-built network for HPC and ML applications. It’s the same fabric – and the same adapter – that we use for all our Amazon EC2 fleet – it’s not no different for Graviton.

For users looking to experiment with Arm-based high performance computing, this recipe provides an excellent starting point.

Cleaning Up: Avoid Unnecessary Costs

Pay some attention to the clean-up details in each recipe’s README.md file. This will ensure you don’t leave something behind that’s costing you money after you’ve finished experimenting.

Start in Minutes: Deploy Your HPC Cluster Today

Are you ready to experience high performance computing with CloudFormation? We’ve made the process as straightforward as possible. Visit the HPC Recipes Library on GitHub, choose a recipe, and deploy your HPC cluster in just 20 minutes—faster than most pizza deliveries!

Each recipe includes step-by-step deployment instructions, allowing you to: And they’re designed to be adopted, forked, and modified to quickly quit your needs, and the demands of your users.

The best part? You can experiment with different configurations and architectures without the usual learning curve and setup time. Just choose your recipe, deploy with CloudFormation, and start computing.

Let us know how you get on.

Conclusion

Cloudastra helps organizations harness the power of high performance computing by providing cloud services that simplify HPC cluster deployment and management. By adopting Infrastructure as Code (IaC) with CloudFormation, businesses can achieve:

Automated deployments, reducing manual intervention and setup time.

Consistent, repeatable environments, ensuring stability across different stages.

Enhanced security and compliance, enabling easy tracking of infrastructure changes.

Additionally, automating AWS deployments with CloudFormation streamlines operations, allowing teams to focus on scientific computing, engineering simulations, and data-intensive applications.

Do you like to read more educational content? Read our blogs at Cloudastra Technologies or contact us for business enquiry at Cloudastra Contact Us

Unlocking HPC Potential with CloudFormation and Infrastructure as Code