System Architecture

Compute Nodes

The TGI RAILS compute ecosystem is composed of three node types:

  1. Dual-socket CPU-only login nodes

  2. Dual-socket CPU-only compute nodes

  3. Dual-socket 8-way NVIDIA H100 GPU compute nodes

All processors are Intel Sapphire Rapids CPUs and all have hardware multithreading turned on.

Table. Login Node Specifications

Specification

Value

Model

Dell PowerEdge R660

Number of nodes

2

CPU

Intel Sapphire Rapids 6426Y (PCIe Gen5)

Sockets per node

2

Cores per socket

16

Cores per node

32

Hardware threads per core

2

Hardware threads per node

64

Clock rate (GHz)

~ 2.50

RAM (GB)

256

Cache L1/L2/L3

48KB / 2MB / 37.5MB

Table. CPU Compute Node Specifications

Specification

Value

Model

Dell PowerEdge R760

Number of nodes

3

CPU

Intel Sapphire Rapids 8468 (PCIe Gen5)

Sockets per node

2

Cores per socket

48

Cores per node

96

Hardware threads per core

2

Hardware threads per node

192

Clock rate (GHz)

~ 2.10

RAM (GB)

512

Cache L1/L2/L3

48KB (p/core) / 2MB (p/core) / 105MB (shared)

Local storage (TB)

1.92 TB

Table. 8-way NVIDIA H100 GPU Large Memory Compute Node Specifications

Specification

Value

Model

Dell XE9680

Number of nodes

3

GPU

NVIDIA H100 (Vendor page)

GPUs per node

8

GPU Memory (GB)

80

CPU

Intel Sapphire Rapids 8468

CPU sockets per node

2

Cores per socket

48

Cores per node

96

Hardware threads per core

2

Hardware threads per node

192

Clock rate (GHz)

~ 2.10

RAM (GB)

2,048

Cache L1/L2/L3

48KB(p/core)/ 2MB(p/core)/ 105MB(shared)

Local storage (TB)

3.84 TB

Login Nodes

Login nodes provide interactive support for code editing and compilation.

Specialized Nodes

TGI RAILS will support data transfer nodes (serving the “TGI RAILS” Globus collection) and nodes in support of other services.

Network

TGI RAILS is connected to the NPCF core router & exit infrastructure via two 100Gbps connections, NCSA’s 400Gbps+ of WAN connectivity carry traffic to/from users on an optimal peering.

TGI-RAILS resources are inter-connected with 100Gbps Ethernet.

File Systems

Need to describe the VAST storage system and how it is presented to the system.

*Hardware: VAST 1x1 system with 330TB of flash storage.

$WORK and $SCRATCH

A “module reset” in a job script will populate $WORK environment variables automatically, or you may set them as WORK=/projects/<account>/$USER .


File System

Quota

** Snapshots**

Purged

Key Features

HOME (/u)

5TB.

No/TBA

No

Area for software, scripts, job files, etc. NOT intended as a source/ destination for I/O during jobs

WORK (/projects)

50 TB. Up to 1-25 TB by allocation

No/TBA

No

Area for shared data for a project, common data sets, software, results, etc.

/tmp

1.92 (CPU) or 3.84 TB (GPU) shared or dedicated depending on node usage by job(s), no quotas in place

No

After each job

Locally attached disk for fast small file IO.

quota usage

The storage system does enforce quotas. Details on querying quotas and storage usage will be posted soon.