The Cray X1 and Supercomputer Roadmap

Dec 12, 2002 - 1 Gigaflop/sec on a Cray Y-MP with 8 processors. • 1988 Gordon ... targeting robust, reliable, easy to use, trans-petaflop systems in the 2010 ...
414KB taille 14 téléchargements 228 vues
The Cray X1 and Supercomputer Roadmap

David Tanqueray [email protected]

Copyright 2002, Cray Inc.

13th Daresbury Machine Evaluation Workshop 11-12th December 2002

Sustained Gigaflop • Achieved in 1988 on a Cray Y-MP • Static finite element analysis – – – – – –

Phong Vu, Cray Research Horst Simon, NASA Ames Cleve Ashcraft, Yale University Roger Grimes, Boeing Computing Services John Lewis, Boeing Computer Services Barry Peyton, Oak Ridge Nat. Laboratory

• 1 Gigaflop/sec on a Cray Y-MP with 8 processors • 1988 Gordon Bell prize winner

Copyright 2002, Cray Inc.

13th Daresbury Machine Evaluation Workshop 11-12th December 2002

Sustained Teraflop Goal • In the early 1990’s Cray Research announced a three stage program to achieve a sustained teraflop on a real application by the end of the decade – Cray T3D, Cray T3E, (never built the 3rd one due to SGI acquisition)

• Achieved in 1998 on a T3E • LSMS: Locally self-consistent multiple scattering method • Metallic magnetism simulation for understanding thin film disc drive read heads, magnets used in motors and power generation • 1.02 Teraflops on Cray T3E-1200E with 1480 processors • 1998 Gordon Bell prize winner

Copyright 2002, Cray Inc.

B. Ujfalussy, Xindong Wang, Xiaoguang Zhang, D. M. C. Nicholson, W. A. Shelton, and G. M. Stocks; Oak Ridge National Laboratory. A. Canning; NERSC, Lawrence Berkeley National Laboratory. Yang Wang; Pittsburgh Supercomputing Center B. L. Gyorffy; H. H. Wills; Physics Laboratory, University of Bristol.

13th Daresbury Machine Evaluation Workshop 11-12th December 2002

Sustained Petaflop Goal • Cray is now announcing a three stage program to achieve a sustained petaflop on a variety of applications by the end of the decade – – – –

Cray X1 is the 1st stage in this program Will reach a peak petaflop with the 2nd generation system Will sustain a petaflop either with that system or its follow-on Research for 3rd generation system ongoing under the DARPA HPCS program • targeting robust, reliable, easy to use, trans-petaflop systems in the 2010 timeframe

Copyright 2002, Cray Inc.

13th Daresbury Machine Evaluation Workshop 11-12th December 2002

Cray Supercomputer Evolution and Roadmap High BW MPP

Microprocessor-based MPP

X1

1000’s of processors UNICOS/mk

T3D

T3E

1200

X1e

BW+ BlackWidow

1350

SV1ex

Cray-1

X-MP

Y-MP

C90

T90 J90

SV1

High Bandwidth Vector 1-32 processors

Copyright 2002, Cray Inc.

13th Daresbury Machine Evaluation Workshop 11-12th December 2002

sustained petaflops BW++

DARPA HPCS

Architecture

Copyright 2002, Cray Inc.

13th Daresbury Machine Evaluation Workshop 11-12th December 2002

Example: US DOE ASCI Applications Hydro • • •

&

• • • • •

Explicit Robust numerical algorithms Local, cache friendly, high computational intensity Easy to parallelize and load balance



NASPB: Block Tridiagonal

40

Radiation Implicit Numerical algorithms not robust Local/global gather/scatter, latency sensitive Hard to parallelize Dominates runtime

12

NASPB: Conjugate Gradient T90

10

35

T3E

30

8

25

T3E

6

20

T90

15

4

10

2

5 0

0 0

40

80

120

160

200

Num Processors

Copyright 2002, Cray Inc.

240

280

0

40

80

120

160

Num Processors

13th Daresbury Machine Evaluation Workshop 11-12th December 2002

200

240

280

Cray X1

Cray PVP

Cray T3E

• Powerful single processors • Very high memory bandwidth • Non-unit stride computation • Special ISA features

• Distributed shared memory • High BW scalable network • Optimized communication and synchronization features

• Modernized the ISA and microarchitecture

• Improved via custom processors

Copyright 2002, Cray Inc.

13th Daresbury Machine Evaluation Workshop 11-12th December 2002

Not another classic vector machine • New instruction set • New system architecture • New processor microarchitecture • “Classic” vector machines were programmed differently – Classic vector: Optimize for loop length with little regard for locality – Scalable micro: Optimize for locality with little regard for loop length

• The Cray X1 is programmed like a parallel micro-based machine – Rewards locality: register, cache, local memory, remote memory – Decoupled microarchitecture doesn’t mind short loop lengths – Likes loop-level parallelism, as do newer scalar processors

Copyright 2002, Cray Inc.

13th Daresbury Machine Evaluation Workshop 11-12th December 2002

Cray X1 Instruction Set Architecture Enhanced ISA (Based on a Decade of Research) – – – –

Much larger register set All operations performed under mask 64- and 32-bit memory and IEEE arithmetic New synchronization features for scalability

Inherent advantages of vector ISA – Very high single processor performance with low complexity ops/sec = (cycles/sec) * (instrs/cycle) * (ops/instr)

– Localized computation – large register sets with very regular access patterns – registers and functional units grouped into local clusters (pipes) ⇒ excellent fit with future IC technology

– Latency tolerance and pipelining to memory/network ⇒ very well suited for scalable systems

Copyright 2002, Cray Inc.

13th Daresbury Machine Evaluation Workshop 11-12th December 2002

Cray X1 X1 Node Cray Node P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

$

$

$

$

$

$

$

$

$

$

$

$

$

$

$

$

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

mem

mem

mem

mem

mem

mem

mem

mem

mem

mem

mem

mem

mem

mem

mem

mem

IO

IO

51 Gflops, 200 GB/s

• Four multistream processors (MSPs), each 12.8 Gflops • High bandwidth local shared memory (128 Direct Rambus channels) • 32 network links and four I/O links per node Copyright 2002, Cray Inc.

13th Daresbury Machine Evaluation Workshop 11-12th December 2002

NUMA Scalable up to 1024 Nodes

Interconnection Network

• • Copyright 2002, Cray Inc.

32 parallel networks for bandwidth Global shared memory across machine 13th Daresbury Machine Evaluation Workshop 11-12th December 2002

Designed for Scalability T3E Heritage • Distributed shared memory (DSM) architecture – Low latency, load/store access to entire machine (tens of TBs)

• Decoupled vector memory architecture for latency tolerance – Thousands of outstanding references, flexible addressing

• Very high performance network – High bandwidth, fine-grained transfers – Same router as Origin 3000, but 32 parallel copies of the network

• Architectural features for scalability – Remote address translation – Global coherence protocol optimized for distributed memory – Fast synchronization

• Parallel I/O scales with system size

Copyright 2002, Cray Inc.

13th Daresbury Machine Evaluation Workshop 11-12th December 2002

Decoupled Microarchitecture • Decoupled access/execute and decoupled scalar/vector • Scalar unit runs ahead, doing addressing and control – Scalar and vector loads issued early – Store addresses computed and saved for later use – Operations queued and executed later when data arrives

• Hardware dynamically unrolls loops – – – –

Scalar unit goes on to issue next loop before current loop has completed Full scalar register renaming through 8-deep shadow registers Vector loads renamed through load buffers Special sync operations keep pipeline full, even across barriers

This is key to making the system like short-VL code

Copyright 2002, Cray Inc.

13th Daresbury Machine Evaluation Workshop 11-12th December 2002

Cray SV2 I/O Infrastructure I/O-Centric view of the Cray SV2 System 15 14 Peak signal rates each direction: 13 12 11 4.8 GBytes/Sec/SV2 Node 10 I-Chip I-Chip I-Chip 9 I-Chip 8 I-Chip 75 GBytes/Sec/SV2 Cabinet I-Chip 7 I-Chip I-Chip I-Chip Sv2 Node6 5 I-Chip I-Chip I-Chip Sv2 Node I-Chip 4 Sv2 Node I-Chip 3 I-Chip I-Chip Sv2 Node I-Chip 2 I-Chip I-Chip Sv2 Node I-Chip I-Chip 1 I-Chip I-Chip Sv2 Node I-Chip I-Chip 0 I-Chip I-Chip Sv2 Node I-Chip I-Chip I-Chip Sv2I-Chip Node I-Chip 1.2 GByte/Sec SPC I-Chip I-Chip I-Chip

Sv2 Node I-Chip I-Chip I-Chip I-Chip Sv2 Node I-Chip I-Chip I-Chip I-Chip Sv2 Node I-Chip I-Chip I-Chip Sv2 Node I-Chip I-Chip Sv2 Node I-Chip I-Chip Sv2 Node I-Chip I-Chip Sv2 I-Chip I-Chip Sv2Node Node

I-Chip I-Chip I-Chip I-Chip I-Chip I-Chip I-Chip I-Chip

Copyright 2002, Cray Inc.

1.2 GByte/Sec SPC 1.2 GByte/Sec SPC 1.2 GByte/Sec SPC

13th Daresbury Machine Evaluation Workshop 11-12th December 2002

System Software

Copyright 2002, Cray Inc.

13th Daresbury Machine Evaluation Workshop 11-12th December 2002

Cray X1 UNICOS/mp UNIX kernel executes on each Application node (somewhat like Chorus™ microkernel on UNICOS/mk) Provides: SSI, scaling & resiliency

Single System Image

256 CPU Cray X1

System Service nodes provide file services, process management and other basic UNIX functionality (like /mk servers). User commands execute on System Service nodes.

Copyright 2002, Cray Inc.

13th Daresbury Machine Evaluation Workshop 11-12th December 2002

UNICOS/mp Global resource manager (Psched) schedules applications on Cray X1 nodes

Commands launch applications like on T3E (mpprun)

Programmer’s View • Traditional shared memory applications – OpenMP, pthreads – 4 MSPs (about 50 Gflops) – Single node memory (16-32 GB) – Very high memory bandwidth – No data locality issues

• Distributed memory applications – MPI, shmem(), UPC, Co-array Fortran – Same kinds of optimizations as on microprocessor-based machines • work and data decomposition • cache blocking

– But less worry about communication/computation ratio, strides and bandwidth • multiple GB/s network bandwidth between nodes • scatter/gather and large-stride support

Copyright 2002, Cray Inc.

13th Daresbury Machine Evaluation Workshop 11-12th December 2002

Mechanical Design

Copyright 2002, Cray Inc.

13th Daresbury Machine Evaluation Workshop 11-12th December 2002

Packaging - node board lower side Field-Replaceable Memory Daughter Cards

Spray Cooling Caps

CPU MCM 8 chips

Network Interconnect

PCB 17” x 22’’

Copyright 2002, Cray Inc.

Air Cooling Manifold

13th Daresbury Machine Evaluation Workshop 11-12th December 2002

Cray X1 Node Module

Copyright 2002, Cray Inc.

13th Daresbury Machine Evaluation Workshop 11-12th December 2002

Cray X1 Mainframe, Undressed

Copyright 2002, Cray Inc.

13th Daresbury Machine Evaluation Workshop 11-12th December 2002

64 Processor Cray X1 System ~820 Gflops

Copyright 2002, Cray Inc.

13th Daresbury Machine Evaluation Workshop 11-12th December 2002

256 Processor Cray X1 System ~ 3.3 Tflops

Copyright 2002, Cray Inc.

13th Daresbury Machine Evaluation Workshop 11-12th December 2002

Cray X1 Compute Density Comparison Cray X1

NEC SX-6

IBM p690 with SP Switch

Peak GFLOPS per Processor

12.8

8.0

5.2

No. of Processors to Equal 800 GFLOPS

64

100

156

Cabinets

1

13

5

Floor Space

32 sq. ft.

169 sq. ft.

64 sq.ft.

Compute Density

25 Gflops/ft2

2 Gflops/ft2

13 Gflops/ft2

Compute Efficiency

13 Gflops/kw

8 Gflops/kw

5.2 Gflops/kw

Copyright 2002, Cray Inc.

13th Daresbury Machine Evaluation Workshop 11-12th December 2002

Cray X1 Summary • Extreme system capability – Tens of TFLOPS in a Single System Image (SSI) – Efficient operation due to balanced system design – Focus on sustained performance on the most challenging problems

Achieved via: – Very powerful single processors • High instruction-level parallelism • High bandwidth memory system

– Best in the world scalability • • • •

Copyright 2002, Cray Inc.

Latency tolerance via decoupled vector processors Very high-performance, tightly integrated network Scalable address translation and communication protocols Robust, highly scalable operating system (Unicos/mp)

13th Daresbury Machine Evaluation Workshop 11-12th December 2002

Questions?

Copyright 2002, Cray Inc.

13th Daresbury Machine Evaluation Workshop 11-12th December 2002