Category Archives: Upgrades

New SMP+GPU system: Executor

** Forward this to any group members that may be interested in new HPC systems **

Executor:

A new large memory footprint SMP+GPU system has been released, named executor.structbio.pitt.edu. These types of system designs are useful when many cores (more than would be on a single cluster node) require access to a large bank of shared memory modules.

Executor is similar to Archer in usage scenarios, but has more cores (which are faster), larger RAM, and more high-performance local scratch space. In addition, it has two of the latest generation GPUs for parallel co-processing.

The specs:

64 AMD “Epyc” CPU Cores running at 3.0GHz (with boost)
512GB RAM
20TB scratch volume (located at /executor)
2 – RTX 2080 TI GPUS (11 GB GDDR6 RAM per GPU)
8704 total CUDA compute cores

Executor uses the SLURM job scheduler for job submission, and environment modules for loading software environments. Login via SSH similar to other HPC systems using your Structbio ID. SLURM is slightly different from PBS in syntax, but script conversion is not difficult. Read more about that here.

Currently, this system has been tested with CryoEM software such as Relion 3, MotionCor, EMAN2, CTFFIND, Auto3dem. If you would like additional software / modules installed, send me an email request.

Happy Computing,

Doug

Ultron: new GPU nodes released

Testing and performance tuning on these new GPU  nodes has gone faster than expected, so I’m releasing them into the wild for general use.  This adds 440 Teraflops of GPU processing capability to our cluster.

The “feature” parameter is required in your PBS script to specify a particular type of GPU.  For example, if you wanted to use 2 Titan V100 gpus on 1 node:

qsub -l nodes=1,ppn=20,gpus=2,feature=V100

The 4 Tesla K80 cores remain in the system, and can be specified like:

qsub -l nodes=1,ppn=28,gpus=4,feature=K80

For more details you can see this post or visit the HPC section of the website.

Happy Computing.

Ultron: New GPU Nodes In Beta Testing

Two new nodes have been added to our HPC cluster, Ultron.

These nodes each contain two Titan V GPUs.  These are the newest V100 core type from Nvidia. Each V100 core has 5120 CUDA cores and 12GB HBM2 memory.  HBM2 memory is significant because of the incredibly fast transfer speed to the compute cores — 1.7Gbps!

This all equates to 110 teraflops of performance per GPU, for a total added compute of 440 teraflops (not including the new CPU cores)!  While not all applications support GPU processing, those that do benefit greatly from the process.

The servers, node11 and node12, have dual 10-core Intel 2.4GHz E5-2640 v4 CPUs.  Each has 128GB RAM and a local SSD for scratch storage (in /local) if needed.

If you wish to test the new nodes, you can submit a job.  I added the “feature” parameter to the queuing system to differentiate between the old GPUs (K80) and the new GPUs (V100).  Note that the nodes are still being performance tuned, so there may be unexpected interruptions during the beta phase.

For example, if you wanted to use 2 V100 gpus on 1 node:

qsub -l nodes=1,ppn=20,gpus=2,feature=V100

If you do submit a job, please email me with any issues and performance feedback.  I you would like me to monitor your job on the backend, send me an email before submitting it.

Here is the total spec sheet for Ultron in its current state:

Ultron Specifications

Head node with 24TB ultra-fast SSD storage for user homes
12 total compute nodes
10 Compute nodes each have dual 14-core Intel 2.4GHz E5-2680 v4 CPU’s
2 Compute nodes each have dual 10-core Intel 2.4GHz E5-2640 v4 CPU’s
10 Compute nodes each have 256GB RAM
2 Compute nodes each have 128GB RAM
Compute nodes each have 512GB SSD’s for local scratch space (/local)
Two Nvidia Tesla K80’s (two cores each) are available for GP-GPU / CUDA calculations (node10)
Four Nvidia Titan V100’s are available for GP-GPU / CUDA calculations (node11 and node12)
Cluster communication is via 56Gb/s FDR Infiniband

In total, there are 320 Xeon cores and 2.82TB RAM (CPU cores and RAM)
In total, there are 30,464 CUDA cores and 96GB GDDR5 (GPU cores and RAM)

Firmware upgrade

A non-service affecting upgrade will take place today on the firmware of the devices serving network home directories.  This is necessary to facilitate a security patch to the operating system hosting these systems.

All home directories are replicated nightly to disk, and backed up nightly to tape.  The former provides quick resolution if there is a catastrophic issue with one of the arrays, the latter provides protection against crypto-ware and an archival catalog of files ~3 months historically.