System Architecture

Compute Nodes

TGI RAILS is made up of a total of 8 nodes of three node types:

  • 2 Dual-socket CPU-only login nodes

  • 3 Dual-socket CPU-only compute nodes

  • 3 Dual-socket 8-way NVIDIA H100 GPU compute nodes

All processors are Intel Sapphire Rapids CPUs and all have hardware multithreading turned on.

Login Node Specifications

Login nodes provide interactive support for code editing, compilation and job submission. Login nodes do not contain GPUs and are not intended for computationally intensive workloads. See our login node policy for more information.

Specification

Value

Model

Dell PowerEdge R660

Number of nodes

2

CPU

Intel Sapphire Rapids 6426Y (PCIe Gen5)

Sockets per node

2

Cores per socket

16

Cores per node

32

Hardware threads per core

2

Hardware threads per node

64

Clock rate (GHz)

~ 2.50

RAM (GB)

256

Cache L1/L2/L3

48KB / 2MB / 37.5MB

CPU Compute Node Specifications

Specification

Value

Model

Dell PowerEdge R760

Number of nodes

3

CPU

Intel Sapphire Rapids 8468 (PCIe Gen5)

Sockets per node

2

Cores per socket

48

Cores per node

96

Hardware threads per core

2

Hardware threads per node

192

Clock rate (GHz)

~ 2.10

RAM (GB)

512

Cache L1/L2/L3

48KB (p/core) / 2MB (p/core) / 105MB (shared)

Local storage (TB)

1.92 TB

GPU Compute Node Specifications

Specification

Value

Model

Dell XE9680

Number of nodes

3

GPU

NVIDIA H100 (Vendor page)

GPUs per node

8

GPU Memory (GB)

80

CPU

Intel Sapphire Rapids 8468

CPU sockets per node

2

Cores per socket

48

Cores per node

96

Hardware threads per core

2

Hardware threads per node

192

Clock rate (GHz)

~ 2.10

RAM (GB)

2,048

Cache L1/L2/L3

48KB(p/core)/ 2MB(p/core)/ 105MB(shared)

Local storage (TB)

3.84 TB

Network

TGI RAILS is connected to the NPCF core router & exit infrastructure via two 100Gbps connections, NCSA’s 400Gbps+ of WAN connectivity carry traffic to/from users on an optimal peering.

TGI-RAILS resources are inter-connected with 100Gbps Ethernet.

Storage (File Systems)

RAILS storage is powered by the VAST storage system, an all-flash unified storage solution that provides a total raw capacity of 560 TB. With data-aware file compression, the effective capacity of the VAST system is augmented to approximately 1.7 PB. It boasts impressive performance capabilities, delivering 37 GB/s read and 6 GB/s write speeds, with 200,000 IOPS, ensuring efficient and rapid access to stored data. This system includes two primary file systems: Home and Projects which share the same storage capacity.

File System

Total Capacity

Default User Quota

Purged

Description

HOME (/u)

560 TB Raw, ~1.7 PB accessible via VAST compression.

18.5 TB

Never

User home directory, Area for software, scripts, job files, etc.

WORK (/projects)

560 TB Raw, ~1.7 PB accessible via VAST compression.

37.185 TB

Never

Area for shared data for a project, common data sets, software, results, etc.

/tmp

1.92 TB CPU Node, 3.84 TB GPU Node

None

After each job

Locally attached disk for fast small file IO.