The Lenovo ThinkSystem SD665-N V3 server tray is designed for High Performance Computing (HPC), large-scale cloud, heavy simulations, and modeling. It implements Lenovo Neptune™ Direct Water Cooling (DWC) technology to optimally support workloads from technical computing, grid deployments, analytics, and is ideally suited for fields such as research, life sciences, energy, simulation, and engineering.
The unique design of ThinkSystem SD665-N V3 provides the optimal balance of serviceability, performance, and efficiency. By using a standard rack with the ThinkSystem DW612S enclosure equipped with patented stainless steel drip-less quick connectors, the SD665-N V3 provides easy serviceability and extreme density that is well suited for clusters ranging from small enterprises to the world's largest supercomputers.
The Lenovo Neptune™ direct liquid cooling doesn't use risky plastic retrofitting but instead custom-designed copper water loops, so you have peace of mind implementing a platform with liquid cooling at the core of the design.
Compared to other technology, the SD665-N V3 direct water cooling:
- Reduces data center energy costs by up to 40%
- Increases system performance by up to 10%
- Delivers up to 100% heat removal efficiency (depending on the environment)
- Creates a quieter data center with its fan-less design
- Enables data center growth without adding computer room air conditioning
Lenovo’s direct water-cooled solutions are factory-integrated and are re-tested at the rack-level to ensure that a rack can be directly deployed at the customer site. This careful and consistent quality testing has been developed as a result of over a decade of experience designing and deploying DWC solutions to the very highest standards.
Scalability and performance
The ThinkSystem SD665-N V3 server tray and DW612S enclosure offer the following features to boost performance, improve scalability, and reduce costs:
- Each SD665-N V3 node supports two fourth-generation AMD EPYC processors, four NVIDIA H100 SXM GPUs, 24x TruDDR5 DIMMs, two OSFP 800G cages for high-speed I/O, and up to two drive bays, all in a 1U form factor.
- Up to 6x SD665-N V3 nodes are installed in the DW612S enclosure, occupying only 6U of rack space. It is a highly dense, scalable, and price-optimized offering.
- Supports two fourth-generation AMD EPYC 9004 processors
- Up to 128 cores and 256 threads
- Core speed of up to 4.1 GHz
- Nominal TDP rating of up to 360 W, configurable TDP up to 400 W
- Supports four NVIDIA H100 GPUs
- 700W SXM5 GPUs with configurable EDP (Electrical Design Point)
- 80GB HBM3 or 94GB HBM2e GPU memory per GPU
- Interconnected using dual NVLink 4.0 connections
- Up to 400 Gb/s NDR connectivity to each through four NVIDIA ConnectX-7 embedded network controllers
- Support for DDR5 memory DIMMs to maximize the performance of the memory subsystem:
- Up to 24 DDR5 memory DIMMs, 12 DIMMs per processor
- 12 memory channels per processor (1 DIMM per channel)
- DIMM speeds up to 4800 MHz
- Using 128GB 3DS RDIMMs, the node supports up to 3TB of system memory
- Supports high-speed GPU Direct networking with dual InfiniBand NDRx2 800Gb/s connections
- Choice of two OSFP-DD or alternatively OSFP ports
- Each port supports OSFP 800G (2x400 Gb/s) or OSFP 400G (400 Gb/s) connectivity
- Direct connections to the GPUs - each OSFP port connects to two GPUs
- Supports up to two NVMe SSDs, as follows:
- Two E3.S 1T NVMe SSDs
- Two 7mm NVMe SSDs
- One 15mm NVMe SSD
- Drives are NVMe to maximize I/O performance in terms of throughput, bandwidth, and latency.
- Supports a PCIe 4.0 x4 high-speed M.2 NVMe drive installed in an adapter for convenient operating system boot and internal storage functions.
- The node includes one Gigabit and two 25 Gb Ethernet onboard ports for cost effective networking.
Energy efficiency
The direct water cooled solution offers the following energy efficiency features to save energy, reduce operational costs, increase energy availability, and contribute to a green environment:
- Water cooling eliminates power that is drawn by cooling fans in the enclosure and dramatically reduces the required air movement in the server room, which also saves power. In combination with an Energy Aware Runtime environment, savings as much as 40% are possible in the data center due to the reduced need for air conditioning.
- Water chillers may not be required with a direct water cooled solution. Chillers are a major expense for most geographies and can be reduced or even eliminated because the water temperature can now be 45°C instead of 18°C in an air-cooled environment.
- Up to 100% heat recovery is possible with the direct water cooled design, depending on water temperature chosen. Heat energy absorbed may be reused for heating buildings in the winter, or generating cold through Adsorption Chillers, for further operating expense savings.
- The processors and other microelectronics are run at lower temperatures because they are water cooled, which uses less power, and allows for higher performance through Turbo Mode.
- The processors and accelerators are run at uniform temperatures because they are cooled in parallel loops, which avoid thermal jitter and provides higher and more reliable performance at same power.
- Low-voltage 1.1V DDR5 memory offers energy savings compared to 1.2V DDR4 DIMMs, an approximately 20% decrease in power consumption
- 80 Plus Titanium power supplies ensure energy efficiency.
- There are power monitoring and management capabilities through the System Management Module in the DW612S enclosure.
- Lenovo power/energy meter based on TI INA226 measures DC power for the CPU and the GPU board at higher than 97% accuracy and 100 Hz sampling frequency to the XCC and can be leveraged both in-band and out-of-band using IPMI raw commands.
- Optional Lenovo XClarity Energy Manager provide advanced data center power notification, analysis, and policy-based management to help achieve lower heat output and reduced cooling needs.
- Optional Energy Aware Runtime provides sophisticated power monitoring and energy optimization on a job-level during the application runtime without impacting performance negatively.
Manageability and security
The following powerful systems management features simplify local and remote management of the SD665-N V3 server:
- The server includes an XClarity Controller 2 (XCC2) to monitor server availability. Optional upgrade to XCC Platinum to provide remote control (keyboard video mouse) functions, support for the mounting of remote media files, FIPS 140-3 security, enhanced NIST 800-193 support, boot capture, power capping, and other management and security features.
- Support for industry standard management protocols, IPMI 2.0, SNMP 3.0, Redfish REST API, serial console via IPMI
- Integrated Trusted Platform Module (TPM) 2.0 support enables advanced cryptographic functionality, such as digital signatures and remote attestation.
- Supports AMD Secure Root-of-Trust, Secure Run and Secure Move features to minimize potential attacks and protect data as the OS is booted, as applications are run and as applications are migrated from server to server.
- Supports Secure Boot to ensure only a digitally signed operating system can be used.
- Industry-standard Advanced Encryption Standard (AES) NI support for faster, stronger encryption.
- With the System Management Module (SMM) installed in the enclosure, only one Ethernet connection is needed to provide remote systems management functions for all SD665-N V3 servers and the enclosure.
- The SMM management module has two Ethernet ports which allows a single Ethernet connection to be daisy chained across 7 enclosures and 84 servers, thereby significantly reducing the number of Ethernet switch ports needed to manage an entire rack of SD665-N V3 servers and DW612S enclosures.
- The DW612S enclosure includes drip sensors that monitor the inlet and outlet manifold quick connect couplers; leaks are reported via the SMM.
- The server supports Lenovo XClarity suite software with Lenovo XClarity Administrator, Lenovo XClarity Provisioning Manager, and XClarity Energy Manager. They are described further in the Software section of this product guide.
- Lenovo HPC & AI Software Stack provides our HPC customers you with a fully tested and supported open-source software stack to enable your administrators and users with for the most effective and environmentally sustainable consumption of Lenovo supercomputing capabilities.
- Our Confluent management system and Lenovo Intelligent Computing Orchestration (LiCO) web portal provides an interface designed to abstract the users from the complexity of HPC cluster orchestration and AI workloads management, making open-source HPC software consumable for every customer.
- LiCO web portal provides workflows for both AI and HPC, and supports multiple AI frameworks, allowing you to leverage a single cluster for diverse workload requirements.
Availability and serviceability
The SD665-N V3 node and DW612S enclosure provide the following features to simplify serviceability and increase system uptime:
- Designed to run 24 hours a day, 7 days a week
- Depending on the configuration and node population, the DW612S enclosure supports N+1 power policies for its power supplies, which means greater system uptime.
- All supported power supplies are hot-swappable, including the water-cooled power supplies.
- Toolless cover removal on the trays provides easy access to upgrades and serviceable parts, such as adapters and memory.
- The server uses ECC memory and supports memory RAS features including Single Device Data Correction (SDDC, also known as Chipkill), Patrol/Demand Scrubbing, Bounded Fault, DRAM Address Command Parity with Replay, DRAM Uncorrected ECC Error Retry, On-die ECC, ECC Error Check and Scrub (ECS), and Post Package Repair.
- Proactive Platform Alerts (including PFA and SMART alerts): Processors, voltage regulators, memory, internal storage (HDDs and SSDs, NVMe SSDs, M.2 storage), fans, power supplies, and server ambient and subcomponent temperatures. Alerts can be surfaced through the XClarity Controller to managers such as Lenovo XClarity Administrator and other standards-based management applications. These proactive alerts let you take appropriate actions in advance of possible failure, thereby increasing server uptime and application availability.
- The XCC offers optional remote management capability and can enable remote keyboard, video, and mouse (KVM) control and remote media for the node.
- Built-in diagnostics in UEFI, using Lenovo XClarity Provisioning Manager, speed up troubleshooting tasks to reduce service time.
- Lenovo XClarity Provisioning Manager supports diagnostics and can save service data to a USB key drive or remote CIFS share folder for troubleshooting and reduce service time.
- Auto restart in the event of a momentary loss of AC power (based on power policy setting in the XClarity Controller service processor)
- Virtual reseat is a supported feature of the System Management Module (SMM2) which simulates physically removing the node from A/C power and reconnecting the node to AC power from a remote location.
- There is a three-year customer replaceable unit and onsite limited warranty, with next business day 9x5 coverage. Optional warranty upgrades and extensions are available.
- With water cooling, system fans are not required. This results in significantly reduced noise levels on the data center floor, a significant benefit to personnel having to work on site.