New SMP+GPU system: Executor

** Forward this to any group members that may be interested in new HPC systems **

Executor:

A new large memory footprint SMP+GPU system has been released, named executor.structbio.pitt.edu. These types of system designs are useful when many cores (more than would be on a single cluster node) require access to a large bank of shared memory modules.

Executor is similar to Archer in usage scenarios, but has more cores (which are faster), larger RAM, and more high-performance local scratch space. In addition, it has two of the latest generation GPUs for parallel co-processing.

The specs:

64 AMD “Epyc” CPU Cores running at 3.0GHz (with boost)
512GB RAM
20TB scratch volume (located at /executor)
2 – RTX 2080 TI GPUS (11 GB GDDR6 RAM per GPU)
8704 total CUDA compute cores

Executor uses the SLURM job scheduler for job submission, and environment modules for loading software environments. Login via SSH similar to other HPC systems using your Structbio ID. SLURM is slightly different from PBS in syntax, but script conversion is not difficult. Read more about that here.

Currently, this system has been tested with CryoEM software such as Relion 3, MotionCor, EMAN2, CTFFIND, Auto3dem. If you would like additional software / modules installed, send me an email request.

Happy Computing,

Doug

Ultron: new GPU nodes released

Testing and performance tuning on these new GPU  nodes has gone faster than expected, so I’m releasing them into the wild for general use.  This adds 440 Teraflops of GPU processing capability to our cluster.

The “feature” parameter is required in your PBS script to specify a particular type of GPU.  For example, if you wanted to use 2 Titan V100 gpus on 1 node:

qsub -l nodes=1,ppn=20,gpus=2,feature=V100

The 4 Tesla K80 cores remain in the system, and can be specified like:

qsub -l nodes=1,ppn=28,gpus=4,feature=K80

For more details you can see this post or visit the HPC section of the website.

Happy Computing.

Ultron: New GPU Nodes In Beta Testing

Two new nodes have been added to our HPC cluster, Ultron.

These nodes each contain two Titan V GPUs.  These are the newest V100 core type from Nvidia. Each V100 core has 5120 CUDA cores and 12GB HBM2 memory.  HBM2 memory is significant because of the incredibly fast transfer speed to the compute cores — 1.7Gbps!

This all equates to 110 teraflops of performance per GPU, for a total added compute of 440 teraflops (not including the new CPU cores)!  While not all applications support GPU processing, those that do benefit greatly from the process.

The servers, node11 and node12, have dual 10-core Intel 2.4GHz E5-2640 v4 CPUs.  Each has 128GB RAM and a local SSD for scratch storage (in /local) if needed.

If you wish to test the new nodes, you can submit a job.  I added the “feature” parameter to the queuing system to differentiate between the old GPUs (K80) and the new GPUs (V100).  Note that the nodes are still being performance tuned, so there may be unexpected interruptions during the beta phase.

For example, if you wanted to use 2 V100 gpus on 1 node:

qsub -l nodes=1,ppn=20,gpus=2,feature=V100

If you do submit a job, please email me with any issues and performance feedback.  I you would like me to monitor your job on the backend, send me an email before submitting it.

Here is the total spec sheet for Ultron in its current state:

Ultron Specifications

Head node with 24TB ultra-fast SSD storage for user homes
12 total compute nodes
10 Compute nodes each have dual 14-core Intel 2.4GHz E5-2680 v4 CPU’s
2 Compute nodes each have dual 10-core Intel 2.4GHz E5-2640 v4 CPU’s
10 Compute nodes each have 256GB RAM
2 Compute nodes each have 128GB RAM
Compute nodes each have 512GB SSD’s for local scratch space (/local)
Two Nvidia Tesla K80’s (two cores each) are available for GP-GPU / CUDA calculations (node10)
Four Nvidia Titan V100’s are available for GP-GPU / CUDA calculations (node11 and node12)
Cluster communication is via 56Gb/s FDR Infiniband

In total, there are 320 Xeon cores and 2.82TB RAM (CPU cores and RAM)
In total, there are 30,464 CUDA cores and 96GB GDDR5 (GPU cores and RAM)

Webserver / Intranet Planning

Although our public sites are currently hosted at central Computing Services as required by Pitt CSSD, we have a number of intranet sites that are on an internal-use webserver.  One of these is the critical scheduler.

Currently, those sites are on a very old server systems — an Intel Xserve from Apple.  These haven’t been in production for quite some time.

I’m testing those websites in a linux environment to make sure each is going to work correctly.  Not only is this a platform change — that will have a minimal impact — but it will be a significant change in versioning for the databases, PHP version, and Apache version.

Once I have the kinks works out in a virtual machine, these intranet sites will be migrated to a linux server.  This will provide new hardware and additional redundancy in addition to the newer OS and LAMP stack versions.

Backup system upgrade completed

Backup system upgrades complete!  I have migrated our backup system from one capable of 342TB of cold storage to 1.4PB of cold storage, expandable in the future.  This expansion is easily realized using additional tape slots.

For the new system, I considered a number of options, including disk-based and cloud-based backups.  With the recent influx of malware and cryptoware, it was determined that backed up data is much safer in a cold medium such as tape, vs a hot medium like disk.  Cloud-based backup is economical from a backup point of view, but should the need to restore ever arise, costs for downloading and restoring an entire backup server are astronomical.  This was compared to the 6 year lifespan for a typical tape backup robot.

The new solution will provides a server with 8TB of staging space, while also providing faster disks.  This is important for a number of reasons.  Our new LTO-7 drives write data much faster, and also have a capacity of 6.0TB per tape, uncompressed.  We will have the same slot configuration — 228 robotically automated slots.

We plan to add more LTO-7 drives in the future for increased concurrency.  The disk/spool storage is also expandable in this design, should the need arise to have larger spooling capacity.

Now that this system is in place, I will begin to add some new and larger storage systems.  These will be beneficial particularly for cryoem and xray data as these sets continue to grow in resolution.  New storage arrays have already shipped from the vendor and will be integrated over the coming weeks.

Happy Computing.

Firmware upgrade

A non-service affecting upgrade will take place today on the firmware of the devices serving network home directories.  This is necessary to facilitate a security patch to the operating system hosting these systems.

All home directories are replicated nightly to disk, and backed up nightly to tape.  The former provides quick resolution if there is a catastrophic issue with one of the arrays, the latter provides protection against crypto-ware and an archival catalog of files ~3 months historically.

 

Preparing for backup upgrades

Soon, I will be upgrading our backup system to meet the needs of current and future storage expansion projects.  I considered a number of options, including disk-based and cloud-based backups.  With the recent influx of malware and cryptoware, it was determined that backed up data is much safer in a cold medium such as tape, vs a hot medium like disk.  Cloud-based backup is economical from a backup point of view, but should the need to restore ever arise, costs for downloading and restoring an entire backup server are astronomical.  This was compared to the 6 year lifespan for a typical tape backup robot.

Our current (soon to be previous) backup server had 4TB of staging space (to prevent tape scrubbing).  We have 228 robotically automated tape slots using LTO-5, which provides 1.5TB per tape.  We have 3 LTO-5 drives.  Current uncompressed backup capacity is 342TB.

The new solution will provide a server with 8TB of staging space, while also providing faster disks.  This is important for a number of reasons.  Our new LTO-7 drives write data much faster, and also have a capacity of 6.0TB per tape, uncompressed.  We will have the same slot configuration — 228 robotically automated slots.  The system will start with 2 LTO-7 drives for cost reasons, and we will add two additional drives in the next fiscal year.  In addition to the increased performance and reliability advantages, the system will provide us with an uncompressed capacity of 1.4PB.

Once this is in place, I will begin to add some new and larger storage systems.  These will be beneficial particularly for cryoem and xray data as these sets continue to grow in resolution.

Happy Computing.

Ultron storage expanded

I recently completed upgrades for the Ultron storage space, doubling the space from 11TB to 22TB.  This is all-flash storage in RAID10 and offers extreme throughput for HPC jobs.

Note that this is still space only for processing.  Once data is processed it should be moved to one of the file servers since that is much less expensive space.  If you need any assistance with this, email the systems administrator.