Managing an HPC cluster or cloud infrastructure: alternatives to xCAT

xCAT is the eXtreme Cloud Administration Toolkit from IBM.  It’s a suite of tools that IBM has developed to manage large groups of servers, such as a cloud infrastructure or a high-performance computing cluster (HPCC).  I have only used xCAT to administer a mid-sized compute cluster (about 140 compute nodes totaling about 1400 cores running RHEL 5).  Overall, I have not found xCAT to be particularly effective for managing a mid-sized cluster.  In many ways, xCAT is a brilliant piece of software, but like many “brilliant” solutions, it’s just too complex for its own good.  There might be a cluster that is so large and complex that only a tool like xCAT can effectively manage it (especially if you have an administrative staff and you can pay someone to be a full-time xCAT guru).  If you have a smaller cluster with limited administrative resources, you’re better off finding a simpler management solution.

In contrast, I will briefly outline the administrative tools provided by Aspen Systems.  We are in the process of expanding our IBM x1350 cluster with about 1000 compute cores from Aspen. Aspen has developed their own suite of cluster administration tools that take a very different approach.  For a system such as ours (2400 cores total), the Aspen approach makes a lot more sense.  The Aspen tools have also been used to manage much larger clusters for customers such as NOAA, NREL and NIST, so I’m not sure if there is a cluster that is “too big” for Aspen’s tools.  I’ll list some of the key differences between the IBM and Aspen systems:

  1. IBM deploys each node by using a Kickstart or Autoyast script to control the installation of a RedHat or Suse OS.  “Prescripts” are used to set up partitions and do other pre-installation configuration, while “Postscripts” are used to finalize the node after the OS is installed.  If you have to do a custom configuration for a node, you have to learn to use Kickstart scripts and figure out how to customize xCAT, which is not well documented for anything other than “vanilla” IBM hardware.Aspen deploys nodes using a form of imaging.  You build one node and get it working just the way you like it.  

    The Aspen tools take an “image” of the filesystem and deploy it on the other nodes. I put “image” in quotes because it’s not an image in the sense of a bit-by-bit copy of a hard drive.  The “image” is a tarball of the entire filesystem that is unpacked after the new node is partitioned and formatted. We found that the Aspen tools can be used to provision IBM blade servers, so we’re migrating away from xCAT altogether.

  2. xCAT manages the cluster’s internal network using DHCP.  xCAT controls the DHCP server on the management/head/admin node by writing entries into /etc/dhcpcd.conf and /var/lib/dhchd/.

    Aspen tools manage the cluster with static IPs that are set in the file /etc/hosts.  There is a tool to automatically copy the hosts file from the head node to all of the other nodes.  This method seemed primitive to me at first, but it is much more straightforward, and so far I have not found any disadvantages.

I’ll update this comparison once I get more familiar with the Aspen tools.

Leave a Reply