System Grete in Göttingen
We’re happy to announce the beginning of regular user operation for our new GPU cluster, “Grete” in Göttingen.
The main part of the cluster is available via the new partition grete
, consisting of 33 nodes equipped with 4 NVIDIA Tesla A100 40 GB GPUs, 2 AMD Epyc CPUs, and an Infiniband HDR interconnect. The grete:shared
partition contains additionally two nodes with 8 A100 80 GB nodes each. All nodes have 16 CPU cores and 128GB memory per GPU. “Grete” has a dedicated new login node, glogin9
, also available via its DNS alias glogin-gpu.hlrn.de
.
Another 3 GPU nodes are available in the partition grete:interactive
for interactive usage (limited to 2 jobs per user). The grete:preemptible
partition is available for backfilling these nodes. On these nodes, the GPUs are split via Multi-Instance GPU (MIG) into slices with 2 or 3 compute units each and 10 or 20 GB of GPU memory each, respectively. These slices can be requested like GPUs in Slurm. For example, -G 2g.10gb:1
will allocate one slice with 2 compute units and 10 GB of memory. Preemptible jobs do not cost core h, but a compute project account has to be used, like for the preempt
QoS in the CPU partitions.
The default walltime limit on all grete
partitions is 2 days.
Part of “Grete” is a new dedicated flash-based WORK storage system mounted at /scratch
on the new GPU nodes and glogin9
. Each user and each compute project has a soft (hard) block quota of 3 TB (6 TB) and 1M (2M) inodes. The system is intended for fast access to the active data set required by the currently running jobs. The existing “Emmy” WORK file system is still reachable from the new cluster under /scratch-emmy
via a long-distance connection. The HOME and PERM filesystems are shared between “Emmy” and “Grete”.
The default CUDA version is 12.0, and the NVIDIA HPC SDK 23.3 is available via nvhpc/23.3
, nvhpc-byo-compiler/23.3
, nvhpc-hpcx/23.3
and nvhpc-nompi/23.3
modules.
CUDA-enabled OpenMPI is available in the form of HPC-X Toolkit (nvhpc-hpcx/23.3
) and the NVIDIA/Mellanox OFED stack (openmpi-mofed/4.1.5a1
). However, previous OpenMPI versions will not provide CUDA support in combination with Infiniband!
More information about using the new GPU system can be found in [1], and the accounting information has been extended to include the GPUs and MIG slices. [2] For example, in accordance with the recent round of compute time proposals, one full GPU node counts for the equivalent of 600 CPU cores.
Please do not hesitate to contact us if you have questions or need support migrating suitable applications to the GPU system.
The existing GPU nodes ggpu[01-03] with Nvidia V100 32GB GPUs will be migrated to the same site (“RZGö”) as “Grete” in mid-May. The operation will resume with the same “Rocky Linux 8” based OS image as the new GPU nodes and an Infiniband interconnect as part of the grete:shared, preemptible and interactive partitions.
[1] https://www.hlrn.de/doc/display/PUB/GPU+Usage
[2] https://www.hlrn.de/doc/display/PUB/Accounting+in+Core+Hours