The Cray X1 and Supercomputer Roadmap
David Tanqueray
[email protected]
Copyright 2002, Cray Inc.
13th Daresbury Machine Evaluation Workshop 11-12th December 2002
Sustained Gigaflop • Achieved in 1988 on a Cray Y-MP • Static finite element analysis – – – – – –
Phong Vu, Cray Research Horst Simon, NASA Ames Cleve Ashcraft, Yale University Roger Grimes, Boeing Computing Services John Lewis, Boeing Computer Services Barry Peyton, Oak Ridge Nat. Laboratory
• 1 Gigaflop/sec on a Cray Y-MP with 8 processors • 1988 Gordon Bell prize winner
Copyright 2002, Cray Inc.
13th Daresbury Machine Evaluation Workshop 11-12th December 2002
Sustained Teraflop Goal • In the early 1990’s Cray Research announced a three stage program to achieve a sustained teraflop on a real application by the end of the decade – Cray T3D, Cray T3E, (never built the 3rd one due to SGI acquisition)
• Achieved in 1998 on a T3E • LSMS: Locally self-consistent multiple scattering method • Metallic magnetism simulation for understanding thin film disc drive read heads, magnets used in motors and power generation • 1.02 Teraflops on Cray T3E-1200E with 1480 processors • 1998 Gordon Bell prize winner
Copyright 2002, Cray Inc.
B. Ujfalussy, Xindong Wang, Xiaoguang Zhang, D. M. C. Nicholson, W. A. Shelton, and G. M. Stocks; Oak Ridge National Laboratory. A. Canning; NERSC, Lawrence Berkeley National Laboratory. Yang Wang; Pittsburgh Supercomputing Center B. L. Gyorffy; H. H. Wills; Physics Laboratory, University of Bristol.
13th Daresbury Machine Evaluation Workshop 11-12th December 2002
Sustained Petaflop Goal • Cray is now announcing a three stage program to achieve a sustained petaflop on a variety of applications by the end of the decade – – – –
Cray X1 is the 1st stage in this program Will reach a peak petaflop with the 2nd generation system Will sustain a petaflop either with that system or its follow-on Research for 3rd generation system ongoing under the DARPA HPCS program • targeting robust, reliable, easy to use, trans-petaflop systems in the 2010 timeframe
Copyright 2002, Cray Inc.
13th Daresbury Machine Evaluation Workshop 11-12th December 2002
Cray Supercomputer Evolution and Roadmap High BW MPP
Microprocessor-based MPP
X1
1000’s of processors UNICOS/mk
T3D
T3E
1200
X1e
BW+ BlackWidow
1350
SV1ex
Cray-1
X-MP
Y-MP
C90
T90 J90
SV1
High Bandwidth Vector 1-32 processors
Copyright 2002, Cray Inc.
13th Daresbury Machine Evaluation Workshop 11-12th December 2002
sustained petaflops BW++
DARPA HPCS
Architecture
Copyright 2002, Cray Inc.
13th Daresbury Machine Evaluation Workshop 11-12th December 2002
Example: US DOE ASCI Applications Hydro • • •
&
• • • • •
Explicit Robust numerical algorithms Local, cache friendly, high computational intensity Easy to parallelize and load balance
•
NASPB: Block Tridiagonal
40
Radiation Implicit Numerical algorithms not robust Local/global gather/scatter, latency sensitive Hard to parallelize Dominates runtime
12
NASPB: Conjugate Gradient T90
10
35
T3E
30
8
25
T3E
6
20
T90
15
4
10
2
5 0
0 0
40
80
120
160
200
Num Processors
Copyright 2002, Cray Inc.
240
280
0
40
80
120
160
Num Processors
13th Daresbury Machine Evaluation Workshop 11-12th December 2002
200
240
280
Cray X1
Cray PVP
Cray T3E
• Powerful single processors • Very high memory bandwidth • Non-unit stride computation • Special ISA features
• Distributed shared memory • High BW scalable network • Optimized communication and synchronization features
• Modernized the ISA and microarchitecture
• Improved via custom processors
Copyright 2002, Cray Inc.
13th Daresbury Machine Evaluation Workshop 11-12th December 2002
Not another classic vector machine • New instruction set • New system architecture • New processor microarchitecture • “Classic” vector machines were programmed differently – Classic vector: Optimize for loop length with little regard for locality – Scalable micro: Optimize for locality with little regard for loop length
• The Cray X1 is programmed like a parallel micro-based machine – Rewards locality: register, cache, local memory, remote memory – Decoupled microarchitecture doesn’t mind short loop lengths – Likes loop-level parallelism, as do newer scalar processors
Copyright 2002, Cray Inc.
13th Daresbury Machine Evaluation Workshop 11-12th December 2002
Cray X1 Instruction Set Architecture Enhanced ISA (Based on a Decade of Research) – – – –
Much larger register set All operations performed under mask 64- and 32-bit memory and IEEE arithmetic New synchronization features for scalability
Inherent advantages of vector ISA – Very high single processor performance with low complexity ops/sec = (cycles/sec) * (instrs/cycle) * (ops/instr)
– Localized computation – large register sets with very regular access patterns – registers and functional units grouped into local clusters (pipes) ⇒ excellent fit with future IC technology
– Latency tolerance and pipelining to memory/network ⇒ very well suited for scalable systems
Copyright 2002, Cray Inc.
13th Daresbury Machine Evaluation Workshop 11-12th December 2002
Cray X1 X1 Node Cray Node P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
mem
mem
mem
mem
mem
mem
mem
mem
mem
mem
mem
mem
mem
mem
mem
mem
IO
IO
51 Gflops, 200 GB/s
• Four multistream processors (MSPs), each 12.8 Gflops • High bandwidth local shared memory (128 Direct Rambus channels) • 32 network links and four I/O links per node Copyright 2002, Cray Inc.
13th Daresbury Machine Evaluation Workshop 11-12th December 2002
NUMA Scalable up to 1024 Nodes
Interconnection Network
• • Copyright 2002, Cray Inc.
32 parallel networks for bandwidth Global shared memory across machine 13th Daresbury Machine Evaluation Workshop 11-12th December 2002
Designed for Scalability T3E Heritage • Distributed shared memory (DSM) architecture – Low latency, load/store access to entire machine (tens of TBs)
• Decoupled vector memory architecture for latency tolerance – Thousands of outstanding references, flexible addressing
• Very high performance network – High bandwidth, fine-grained transfers – Same router as Origin 3000, but 32 parallel copies of the network
• Architectural features for scalability – Remote address translation – Global coherence protocol optimized for distributed memory – Fast synchronization
• Parallel I/O scales with system size
Copyright 2002, Cray Inc.
13th Daresbury Machine Evaluation Workshop 11-12th December 2002
Decoupled Microarchitecture • Decoupled access/execute and decoupled scalar/vector • Scalar unit runs ahead, doing addressing and control – Scalar and vector loads issued early – Store addresses computed and saved for later use – Operations queued and executed later when data arrives
• Hardware dynamically unrolls loops – – – –
Scalar unit goes on to issue next loop before current loop has completed Full scalar register renaming through 8-deep shadow registers Vector loads renamed through load buffers Special sync operations keep pipeline full, even across barriers
This is key to making the system like short-VL code
Copyright 2002, Cray Inc.
13th Daresbury Machine Evaluation Workshop 11-12th December 2002
Cray SV2 I/O Infrastructure I/O-Centric view of the Cray SV2 System 15 14 Peak signal rates each direction: 13 12 11 4.8 GBytes/Sec/SV2 Node 10 I-Chip I-Chip I-Chip 9 I-Chip 8 I-Chip 75 GBytes/Sec/SV2 Cabinet I-Chip 7 I-Chip I-Chip I-Chip Sv2 Node6 5 I-Chip I-Chip I-Chip Sv2 Node I-Chip 4 Sv2 Node I-Chip 3 I-Chip I-Chip Sv2 Node I-Chip 2 I-Chip I-Chip Sv2 Node I-Chip I-Chip 1 I-Chip I-Chip Sv2 Node I-Chip I-Chip 0 I-Chip I-Chip Sv2 Node I-Chip I-Chip I-Chip Sv2I-Chip Node I-Chip 1.2 GByte/Sec SPC I-Chip I-Chip I-Chip
Sv2 Node I-Chip I-Chip I-Chip I-Chip Sv2 Node I-Chip I-Chip I-Chip I-Chip Sv2 Node I-Chip I-Chip I-Chip Sv2 Node I-Chip I-Chip Sv2 Node I-Chip I-Chip Sv2 Node I-Chip I-Chip Sv2 I-Chip I-Chip Sv2Node Node
I-Chip I-Chip I-Chip I-Chip I-Chip I-Chip I-Chip I-Chip
Copyright 2002, Cray Inc.
1.2 GByte/Sec SPC 1.2 GByte/Sec SPC 1.2 GByte/Sec SPC
13th Daresbury Machine Evaluation Workshop 11-12th December 2002
System Software
Copyright 2002, Cray Inc.
13th Daresbury Machine Evaluation Workshop 11-12th December 2002
Cray X1 UNICOS/mp UNIX kernel executes on each Application node (somewhat like Chorus™ microkernel on UNICOS/mk) Provides: SSI, scaling & resiliency
Single System Image
256 CPU Cray X1
System Service nodes provide file services, process management and other basic UNIX functionality (like /mk servers). User commands execute on System Service nodes.
Copyright 2002, Cray Inc.
13th Daresbury Machine Evaluation Workshop 11-12th December 2002
UNICOS/mp Global resource manager (Psched) schedules applications on Cray X1 nodes
Commands launch applications like on T3E (mpprun)
Programmer’s View • Traditional shared memory applications – OpenMP, pthreads – 4 MSPs (about 50 Gflops) – Single node memory (16-32 GB) – Very high memory bandwidth – No data locality issues
• Distributed memory applications – MPI, shmem(), UPC, Co-array Fortran – Same kinds of optimizations as on microprocessor-based machines • work and data decomposition • cache blocking
– But less worry about communication/computation ratio, strides and bandwidth • multiple GB/s network bandwidth between nodes • scatter/gather and large-stride support
Copyright 2002, Cray Inc.
13th Daresbury Machine Evaluation Workshop 11-12th December 2002
Mechanical Design
Copyright 2002, Cray Inc.
13th Daresbury Machine Evaluation Workshop 11-12th December 2002
Packaging - node board lower side Field-Replaceable Memory Daughter Cards
Spray Cooling Caps
CPU MCM 8 chips
Network Interconnect
PCB 17” x 22’’
Copyright 2002, Cray Inc.
Air Cooling Manifold
13th Daresbury Machine Evaluation Workshop 11-12th December 2002
Cray X1 Node Module
Copyright 2002, Cray Inc.
13th Daresbury Machine Evaluation Workshop 11-12th December 2002
Cray X1 Mainframe, Undressed
Copyright 2002, Cray Inc.
13th Daresbury Machine Evaluation Workshop 11-12th December 2002
64 Processor Cray X1 System ~820 Gflops
Copyright 2002, Cray Inc.
13th Daresbury Machine Evaluation Workshop 11-12th December 2002
256 Processor Cray X1 System ~ 3.3 Tflops
Copyright 2002, Cray Inc.
13th Daresbury Machine Evaluation Workshop 11-12th December 2002
Cray X1 Compute Density Comparison Cray X1
NEC SX-6
IBM p690 with SP Switch
Peak GFLOPS per Processor
12.8
8.0
5.2
No. of Processors to Equal 800 GFLOPS
64
100
156
Cabinets
1
13
5
Floor Space
32 sq. ft.
169 sq. ft.
64 sq.ft.
Compute Density
25 Gflops/ft2
2 Gflops/ft2
13 Gflops/ft2
Compute Efficiency
13 Gflops/kw
8 Gflops/kw
5.2 Gflops/kw
Copyright 2002, Cray Inc.
13th Daresbury Machine Evaluation Workshop 11-12th December 2002
Cray X1 Summary • Extreme system capability – Tens of TFLOPS in a Single System Image (SSI) – Efficient operation due to balanced system design – Focus on sustained performance on the most challenging problems
Achieved via: – Very powerful single processors • High instruction-level parallelism • High bandwidth memory system
– Best in the world scalability • • • •
Copyright 2002, Cray Inc.
Latency tolerance via decoupled vector processors Very high-performance, tightly integrated network Scalable address translation and communication protocols Robust, highly scalable operating system (Unicos/mp)
13th Daresbury Machine Evaluation Workshop 11-12th December 2002
Questions?
Copyright 2002, Cray Inc.
13th Daresbury Machine Evaluation Workshop 11-12th December 2002