m7: oracle's next-generation sparc processor - IEEE Xplore

one to eight directly connected processors and up to 32 processors with additional directory and switch application-specific inte- grated circuits (ASICs). Power ...
3MB taille 22 téléchargements 227 vues
.................................................................................................................................................................................................................

M7: ORACLE’S NEXT-GENERATION SPARC PROCESSOR

.................................................................................................................................................................................................................

THE ORACLE SPARC M7 PROCESSOR, WITH 32 CORES INTEGRATED ON A SYSTEM ON CHIP SUPPORTING 256 THREADS, TRIPLES THE PREVIOUS GENERATION’S THROUGHPUT PERFORMANCE. THE M7 PROCESSOR ALSO OFFERS IMPROVED PER-THREAD PERFORMANCE, RAS CAPABILITIES, POWER EFFICIENCY, AND MORE THAN DOUBLE THE

Kathirgamar Aingaran Sumti Jairath Georgios Konstadinidis Serena Leung Paul Loewenstein Curtis McAllister Stephen Phillips Zoran Radovic Ram Sivaramakrishnan David Smentek Thomas Wicki Oracle

MEMORY AND I/O BANDWIDTHS. SOFTWARE-IN-SILICON FEATURES INCLUDE ACCELERATION OF SPECIFIC DATABASE OPERATIONS, POINTER VERSION CHECKING FOR ENHANCED SOFTWARE ROBUSTNESS, AND SUPPORT FOR FINE-GRAINED MEMORY MIGRATION.

......

The Oracle Sparc M7 processor elevates commercial workloads to new levels of throughput.1,2 A new cache and memory hierarchy coupled with other improvements provides triple the processing speed of the previous generation. Multiprocessor scaling and virtualization optimizations enable efficient nonuniform memory access shared-memory systems from one to eight directly connected processors and up to 32 processors with additional directory and switch application-specific integrated circuits (ASICs). Power management improvements are also key to the increased in-system performance. Software-in-silicon features include database query and decompression acceleration, increased application software robustness, and support for robust coherent shared memory across a cluster of M7-based systems. The M7 processor is implemented in Taiwan Semiconductor Manufacturing Company’s 20-nm process with a 13-metal-layer

hierarchical stack to achieve the best tradeoff between speed and routing density.2 A modular approach facilitates physical design and verification closure.

Published by the IEEE Computer Society

0272-1732/15/$31.00 c 2015 IEEE

S4 core, L2 caches, and core cluster The M7 processor (see Figure 1) contains 32 Sparc cores, each supporting up to eight hardware threads. The fourth-generation chip multithreaded (CMT) core is dual-issue and out-of-order, and it dynamically allocates internal resources to share between the active threads. M7 can trade per-thread performance for throughput by running more threads, or it can run fewer higher-performance threads by devoting more resources to each thread. This design lets users determine the appropriate balance of overall throughput versus per-thread performance for their specific application. The 32 cores are grouped into eight Sparc core clusters (SCC), each containing four

.......................................................

36



Core cluster

Core cluster Core

L3 cache L3 cache partition partition

L3 cache L3 cache partition partition

Core cluster

Core cluster

Gateways and directory

On-chip network

L3 cache L3 cache partition partition

Accelerator and memory controllers

L3 cache L3 cache partition partition

Gateways and Link directory interface

Core cluster

Link interface

Accelerator and memory controllers

Core cluster

L2 data cache

Core Core cluster

Core

L2 instruction cache

L2 data cache

Core

Core cluster

Figure 1. Die photograph with core cluster detail. Arrows indicate major interconnects.

processing cores sharing a single 256-Kbyte level-2 (L2) instruction cache. Each SCC has two 256-Kbyte L2 data caches, each shared by two of the four cores,2 as well as a shared level-3 (L3) cache partition. To increase the reliability, availability, and serviceability (RAS) of M7 systems, the core supports “core recovery,” where a core’s architectural state can be migrated to another core to allow continued operation if a core is offlined.

L3 cache The L3 cache is fully shared and partitioned.3 Each L3 partition is contained within an SCC. Any L3 partition may service a request; hot cache lines are migrated to the closest L3 cache partition; cache lines can be replicated and victimized between partitions. The L3 cache supports MOESI (modified, owned, exclusive, shared, invalid) interprocessor cache states. Each L3 cache partition is eight-way set-associative with a 64-byte line size and comprises two addressinterleaved banks. The L3 cache pipeline is 13 cycles long, where four cycles are spent accessing the tag array and generating a way select for accessing the data array. Data array access takes six

cycles. The data array consists of addressinterleaved subbanks and can sustain a throughput of one access every cycle. The M7 cache hierarchy is strictly inclusive. The L3 cache maintains inclusion by tracking the L2 cache lines using a directory, which precisely tracks the L2 cache structure using information embedded in each L2 cache request. For handling cache misses, each L3 cache bank has a 32-entry miss buffer, which allows same-address requests from its core cluster to be chained together. All the requests in the chain are processed before an interprocessor snoop is serviced to the same address. The chain can be established before or after the interprocessor snoop arrives. Besides maintaining the interprocessor states, each L3 cache partition maintains a “Supplier” state. Only one partition can install a cache line of a given address in a Supplier state, so that only one partition supplies data in response to an on-chip broadcast request.

On-chip network A high-bandwidth, low-latency on-chip network (OCN) connects the eight L3 cache partitions to each other, to the four memory

.............................................................

MARCH/APRIL 2015

37

.............................................................................................................................................................................................. HOT CHIPS

Request network

SCC

SCC

MCUs and accelerators

SCC

Data switch

Gateway Gateway

SCC

Data switch

SCC

Data switch

SCC

Gateway Gateway

Data switch

Data switch

SCC

Data switch

MCUs and accelerators

SCC

Data network

Figure 2. On-chip network (OCN) request and data network topologies. The response network and non-OCN connections are omitted. Physically, the request network overlies the data network.

controller units (MCUs), and to the four I/O and coherence gateways, each of which handles distinct sets of addresses. The OCN comprises three physical networks: request, response, and data.2 As Figure 2 shows, the request network comprises four address-sliced rings, each with 11 agents corresponding to the eight L3 cache partitions, two MCUs, and one gateway. A requester on this multicast network can identify the target agents using a bit mask; an agent receives a request from the network only if it is identified in this bit mask. The request rings are unidirectional for protocol simplicity; multiple electrical and logic techniques are used so that a packet completes a circuit of the ring within 3 ns. The response network is a point-to-point network connecting all agents to each other. This network performs aggregation of snoop responses from all L3 cache partitions before delivering a consolidated response to a requesting agent. The data network (Figure 2) consists of a mesh of six pruned 10  10 switches. Each L3 cache partition and gateway agent has a 16-byte OCN data port in each direction. Each MCU instance has two 16-byte OCN

............................................................

38

IEEE MICRO

data ports per direction. The bisection bandwidth of the data network is more than 500 Gbytes per second (GBps).

On-chip cache coherence An on-chip cache coherence protocol maintains coherence between the L3 cache partitions. The protocol adapts automatically to the L3 partition’s access pattern. There are two classes of L3 cache miss requests: peer-topeer requests, which the requester broadcasts to the other cache partitions, and mediated requests, which are sent to the appropriate gateway before any broadcast. A peer-to-peer request provides lowlatency access to the processor’s other L3 cache partitions. If the other partitions do not hold the cache line with the required access rights, or if there is conflict with other peer-to-peer requests for the same cache line, the peer-to-peer request mutates automatically into a mediated request. A mediated request is handled by the gateway, which performs required actions, such as broadcasting to the other L3 caches or sending a request to the system directory to obtain the cache line. Because the gateway

Table 1. Workload caching characteristics. Workload examples Individual programs,

Instructions Not shared

virtual machines, pluggable databases Multiple programs working

Not shared

Data

Data source

Not

Instructions



shared

and data from memory



Shared

Instructions from

on common data (for example,

memory, some

database shared global area)

data as cache-to-cache

Partitioned data processing

Shared

(for example, analytics)

Performance needs

Not

Instructions

shared

cache-to-cache, data from memory

 

Instruction memory prefetch

  

Fast instruction cache-to-cache

 Linked lists, control

Shared

Shared

data structure

serializes mediated requests per cache line, conflicts are not possible. In the event of an L3 cache miss in all partitions, a mediated request is more efficient because there is no on-chip broadcast. Table 1 shows that workloads running on a processor have various memory address sharing patterns. Furthermore, a given workload’s code and data sharing patterns can vary over time. For shared addresses, peer-to-peer requests provide the lowest latency for cacheto-cache transfers between L3 cache partitions. Mediated requests better serve L3 cache misses for unshared addresses, which are not cached in other L3 partitions. Each L3 cache partition maintains a history on whether instruction and data L3 cache misses are serviced from another L3 partition. If data misses serviced from another partition exceed a threshold, the L3 cache partition services further data L3 cache misses using peer-to-peer requests; otherwise, it uses mediated requests. Instruction misses are handled similarly, with their own history and threshold. Thus, the M7 processor adapts quickly to diverse and varying workload behavior. When cache lines are victimized, another cache partition can take over as supplier, thus avoiding writing the cache line to memory.

Both instructions

Instruction and data memory prefetch Predictable quality of service (QoS)

Fast data cache-to-cache

Data memory prefetch High bandwidth low-latency lower-level cache hits Predictable QoS

Fast instruction and data cache-to-cache

and data cache-to-cache

This other cache partition may already hold a shared copy, or it may be designated as a victim cache. Thus, L3 cache partitions in idle SCCs can be used to extend the cache partitions in active SCCs.

Memory The M7 processor has four DDR4 MCUs, each connecting to two buffer-onboard (BoB) chips. Each BoB chip supports two DDR4-2133, -2400, or -2667 channels. The M7 processor thereby supports 16 DDR4 channels with up to 2 Tbytes of memory per processor. The measured bidirectional memory bandwidth is 160 GBps (DDR4-2133), which is about twice and thrice the memory bandwidth of Sparc T54 and M6,5 respectively. The MCUs support individual DIMM retirement without system stoppage, dynamically changing the memory interleaving from 16- to 15-way. To reduce memory access latency, the MCUs buffer memory prefetches, triggered either from local peer-to-peer and mediated requests or from interprocessor speculative read requests. This design reduces memory access latency when the direct path to memory is shorter than the path to memory via the directory.

.............................................................

MARCH/APRIL 2015

39

.............................................................................................................................................................................................. HOT CHIPS

I/O The M7 processor has four links with a total bandwidth of more than 75 GBps feeding one or two external ASICs with PCI Express 3.0 interfaces and more than doubling the I/O bandwidth of previous generations.

Power management Maximizing performance per watt for the M7 design requires the employment of a multitude of high-speed power-saving techniques.6 This includes optimal selection of static RAM (SRAM) cell sizes for speed and density, a new set of high-speed and lowerpower flops, extensive clock gating, and gate sizing and transistor threshold optimization. The Serializer-Deserializer off-chip interface employs two different designs for different channel lengths and real-time adaptation circuits, optimizing link performance per watt. Fine-grained power management achieves the best possible performance per watt under a wide range of environmental and workload conditions. A separate off-chip DC-to-DC converter is used per chip quadrant of two SCCs (or, optionally, per chip half of four SCCs). This design allows for individual supply voltage and digital voltage-frequency scaling per quadrant. Power gating each SCC supply achieves low idle power. Power management employs cycle skipping at the core level, digital frequency scaling at the SCC level, and digital voltage-frequency scaling at the quadrant level. An on-chip power estimator collects the activity of carefully selected on-chip signals and calculates the power in real time for each core, L2 cache, and L3 cache within 5 percent accuracy. This estimator responds very quickly (1 ls) to overcurrent situations, allowing operation close to the maximum current limit without risk of failure. The junction temperature for each SCC is monitored by 16 on-chip thermal sensors.6 A power management controller (PMC) aggregates power and thermal information and throttles individual SCCs.6 When too much current or power is being consumed, the PMC’s first response is to cycle skip (250-ns response time), followed by slower voltage-frequency changes (500 ls). This

............................................................

40

IEEE MICRO

process optimizes the energy-delay tradeoff. The frequency control system uses a thirdgeneration frequency-locked loop7 that reacts to sudden L(di/dt) voltage droops by temporarily reducing the clock frequency, avoiding timing failures while minimizing voltage margin. In addition, the PMC can limit total chip power to maximize performance in power-constrained systems.

Software in silicon The M7 processor incorporates hardware features that have traditionally been performed less efficiently in software (https:// swisdev.oracle.com). Application Data Integrity replaces very costly software instrumentation with low-overhead hardware monitoring, so that pointer-related software bugs, such as buffer overruns, can be detected and trapped in production software. Virtual address masking and fine-grained memory migration enable low-latency garbage collection. Database query acceleration not only accelerates database queries but also can decompress the input data.

Application data integrity The M7 processor includes a new feature, application data integrity (ADI), which provides real-time data integrity checking to guard against pointer-related software bugs. The load-store units in each core check version metadata associated with memory data against a reference version in the virtual address. Four-bit version metadata is stored in memory and is maintained throughout the cache hierarchy and all interconnects. The upper unused bits of 64-bit virtual address pointers contain a 4-bit reference version. A “version mismatch” occurs when ADI is enabled, and the 4-bit reference version contained in the virtual address of a load or store instruction does not match with the memory metadata version in the addressed location in the memory system. Applications can inspect faulting references, diagnose the cause, and take appropriate recovery actions. M7 supports “development/debug” and “in-production” store checking modes, which are configurable per hardware thread. Depending on the selected mode, version mismatches on stores result in precise or

Virtual address masking and fine-grained memory migration In addition to ADI, a virtual address (VA) masking feature lets programs embed additional metadata in upper unused bits of virtual address pointers. Applications using 64bit pointers can set aside 8, 16, 24, or 32 bits, which addressing hardware ignores when dereferencing pointers. This approach allows managed runtime engines, such as Java Virtual Machine, to embed metadata for tracking object information. Runtime savings include elimination of object state table lookups and reduction of cache misses and memory loads. ADI and VA masking can coexist within one application and are in particular interesting for the middleware applications that depend on in-memory fine-grained object migration, commonly addressed by various garbage collection algorithms. Most garbage collection algorithms move live objects with a so-called stop the world scheme where application threads are fully paused during object migration phases. With very large server memories, these algorithms are not scalable, resulting in large and unpredictable pause times. In contrast, a dedicated memory version value can act as a very efficient finegrained access control mechanism, which scales with threads and available memory bandwidth. A garbage collection thread can mark the live set with this dedicated value and at the same time allow all application threads to continue running. If an application thread accesses the live set currently under migration, the application traps to code, which resolves the conflict deterministically. This new algorithm fully bypasses operating system page protection overheads, thereby allowing huge pages.

Database query acceleration The M7 processor contains eight accelerators to which the initial portion of a database

14 12 10 Speedup

disrupting exceptions, respectively. The development/debug checking mode provides a complete stack trace to developers during the development cycle. The in-production mode records the program counter of the first highest-privilege-level store triggering the exception.

E.coli pi.txt world192.txt

8 6 4 2 0 Workload

Figure 3. Accelerator decompression performance relative to a singlethreaded application. The benchmarks comprise the complete genome of E. coli, the first million digits of pi, and the CIA world fact book.

query plan can be offloaded. Custom hardware filters the input data down to a subset of interest, using value and range comparison operators, set membership lookups, and filtering functions. The accelerators can also decompress the input data, which may be in compressed form to conserve memory footprint and bandwidth. The processor cores can then operate on the resulting uncompressed, filtered data. Each accelerator contains four pipes, which can be configured to exploit data parallelism on the same task, or task parallelism on the same dataset. User programs access the accelerators through a special query API call. The API submits commands to the accelerator via a hyperprivileged interface. Data is directly read from, and written to, the user’s memory space. When a command completes, a status message is written out to a designated completion area. User programs can monitor this status using the Sparc mwait instruction, which automatically sleeps a thread until some activity is seen on the target location. No processor cycles are wasted in polling the location. Offloading a decompression operation with no filtering and using only one accelerator pipe (no task or data parallelism is exploited) results in an average of eight times the performance of a single-threaded execution on the processor core (see Figure 3). The compression function used is a proprietary n-gram compressor. Each microbenchmark in Figure 1 is an API function call on a 1-

.............................................................

MARCH/APRIL 2015

41

.............................................................................................................................................................................................. HOT CHIPS

4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0 M6 baseline

Memory bandwidth

Int throughput

OLTP

Java

ERP

FP throughput

Figure 4. M7 processor performance relative to M6 processor. The first bar represents M6 baseline performance across all workloads. Each subsequent bar represents the M7 over M6 performance multiplier for the indicated workload.

million-element input dataset taken from the Canterbury Corpus (http://corpus.canterbury. ac.nz).

Processor performance The M7 processor not only improves total throughput but also increases perthread performance, even at the maximum thread count. The improvement is across a broad range of workloads, including online transaction processing (OLTP), Java, and enterprise resource planning, as well as integer and floating-point applications. Architectural and design choices enable M7 to deliver 2.9 to 3.5 times the performance of M6 (see Figure 4). The shared L2 cache lets existing M6 and T5 application threads spread over two M7 S4 cores, further increasing per thread performance by exploiting twice the core execution resources. M7 has 50 percent greater L2 cache capacity per core than the T5 or M6 processors, with the same cycle-count latency. L2 instruction and data caches are separated to avoid big datasets evicting instruction cache lines in OLTP and data analytics workloads. A partitioned L3 cache reduces L3 cachehit latency by 25 percent. Each L3 cache partition, with its dynamic tracking of workload-caching behavior, applies the best choice of protocol to achieve the shortest latency on

............................................................

42

IEEE MICRO

L3 cache misses. Furthermore, L3 cache partitions can dynamically join or disjoin based on a processor’s active number of threads, providing best per-thread distribution of the last-level cache. The partitioned L3 cache and low-latency OCN is a key design choice on M7, with the resulting performance improvement visible on every workload. The cache and memory hierarchy provides 1.6 Tbytes per second (TBps) of L3 cache bandwidth and 160 GBps of memory bandwidth to 32 cores. This improves L3 cache bandwidth by a factor of five, and memory bandwidth by three times over its predecessor, delivering excellent performance on analytics workloads.

Scalability Figure 5a shows how M7’s direct processor-to-processor coherence links enable shared-memory multiprocessors (SMPs) of up to eight processors, providing up to 256 cores, 2,048 threads, and 16 Tbytes of main memory. Smaller systems use link trunking to achieve an optimal bandwidth-to-connectivity balance. For example, the two nodes of a dualprocessor system are interconnected using four coherence links. Similarly, each node in a four-processor system connects to each other node with two coherence links. Coherence link lanes operate at up to 18 Gbps, achieving a bisection payload bandwidth of more than 1

M7 SMP scalability

>1 TB/s

(a)

>5 TB/s

(b)

Figure 5. The M7 processor supports “glueless” SMPs and larger “glued” SMPs. (a) Eightprocessor “glueless” shared-memory multiprocessor (SMP) and (b) 32-processor “glued” SMP. Processors in a glueless SMP communicate through direct processor-to-processor links, whereas processors in a glued SMP primarily communicate indirectly through switch and directory ASICs.

TBps in an eight-processor system. Cache coherence between processors is maintained using a directory, which is fine-grained interleaved across the processors to avoid hotspots. Additionally, processors implement a linklevel dynamic congestion avoidance mechanism, which enables alternate path routing of data packets based on queue utilization of the intermediate nodes. Link RAS features include frame retry, link retrain, and single lane failover, all performed automatically in hardware. The 32-processor SMP shown in Figure 5b provides up to 1,024 cores, 8,192 threads, and 64 Tbytes of main memory. ASICs perform directory and switching functions, to enable scaling beyond eight processors. In the 32-processor system, two switch groups of six ASICs provide a total of more than 5 TBps bisection system payload bandwidth—six times the bandwidth of the M6 system generation. The coherence directory is fine-grained interleaved across all the ASICs. Groups of four processors are locally fully interconnected, forming a minimal physical domain. The ASICs enforce the boundaries between physical domains and allow dynamic combining of multiple four-processor groups into larger physical domains. Each M7 processor is directly connected via two scalability links

to each of the 12 64-port ASICs. This topology provides lower latency than previous system generations.8 To avoid congesting local coherence links, data traffic within a processor group can overflow to the scalability links. In addition to the link RAS features and the hardware domaining support we described earlier, the M7 system can operate with only five of the six ASICs per group.

Coherent memory clusters The M7 processor supports coherent memory between SMPs within a cluster. SMPs can export shared memory regions to other SMPs in the cluster. The interconnect topology is similar to that shown in Figure 5b, but with up to 64 processors while using fewer switches. The cluster switches are arranged in two groups of three, but the system remains functional with as few as one switch per group. Although the same switch ASICs are used as for the large SMP of Figure 5b, the directory function of each ASIC is not used. This coherent memory cluster feature provides a familiar shared-memory programming model, but in a clustered environment, which is robust against communication and SMP failures.

.............................................................

MARCH/APRIL 2015

43

.............................................................................................................................................................................................. HOT CHIPS

Importing SMPs can optionally allocate a region of local memory to act as a fully sized cache for the imported region. Each cache line in the locally allocated memory is initially marked as invalid using a reserved ADI version value. A load to such a cache line traps to software, which loads the cache line from the home SMP. Other ADI version values are available to support ADI over the cluster. Stores to an imported region are recorded in a software store buffer and update any valid local copy. Software propagates such stores from the store buffer to the home SMP. The storing thread executing a “Store Barrier” forces the propagation of older stores. Propagating a store to the home SMP invalidates any copies of the cache line held by other importing SMPs and makes the store persistent against failure of the storing SMP. To preserve atomicity in the presence of communication and SMP failures, atomic operations are performed at the exporting SMP by a message processing unit (MPU). This approach is in contrast to loads and stores, which are performed locally in the importing SMP. Apart from asynchronous store propagation, all software is executed by the thread performing the memory operation, using MPUs to perform operations on other SMPs. Thus, there are no delays scheduling software to run remotely. Access to remote memory must be authorized by a valid remote authorization key issued by the memory owner. The owner’s MPU validates each incoming request; if the key is invalid, the MPU sends an error notification to the requester and drops the request.

T

he dramatic performance increase and RAS enhancements together with the software-in-silicon and coherent-clustering features of the Oracle M7 processor make it ideal for a wide range of enterprise servers. MICRO

.................................................................... References 1. S. Phillips, “M7: Next Generation SPARC,” Hot Chips 26, 2014. 2. P. Li et al., “A 20nm 32-Core 64MB L3 Cache SPARC M7 Processor,” to be pub-

............................................................

44

IEEE MICRO

lished in Proc. IEEE Int’l Solid-State Circuits Conf., 2015. 3. R. Sivaramakrishnan and S. Jairath, “Next Generation SPARC Processor Cache Hierarchy,” Hot Chips 26, 2014. 4. J. Feehrer et al., “The Oracle Sparc T5 16-Core Processor Scales to Eight Sockets,” IEEE Micro, vol. 33, no. 2, 2013, pp. 48–57. 5. A. Vahidsafa and S. Bhutani, “SPARC M6: Oracle’s Next Generation Processor for Massively Scalable Symmetric Multiprocessor (SMP) Data Center Servers with Enterprise Class RAS,” Hot Chips 25, 2013. 6. V. Krishnaswamy et al., “Fine-Grained Adaptive Power Management of the SPARC M7 Processor,” to be published in Proc. IEEE Int’l Solid-State Circuits Conf., 2015. 7. Y. YangGong et al., “Asymmetric Frequency Locked Loop (AFLL) for Adaptive Clock Generation in a 28nm SPARC M6 Processor,” Proc. IEEE Asian Solid-State Circuits Conf., 2014, pp. 373–376. 8. T. Wicki and J. Schulz, “Bixby: The Scalability and Coherence Directory ASIC in Oracle’s Highly Scalable Enterprise Systems,” Hot Chips 25, 2013.

Kathirgamar Aingaran is a director of hardware development at Oracle. His interests are in joint hardware-software development. Aingaran has an MS in electrical engineering from Stanford University. He is a senior member of IEEE. Contact him at [email protected]. Sumti Jairath is a senior hardware architect at Oracle. His research interests include hardware-software interaction, software-specific hardware design, and coherence scaling. Jairath has a BTech in electronics and communication engineering from the National Institute of Technology, Kurukshetra, India. Contact him at [email protected]. Georgios Konstadinidis is a senior hardware architect at Oracle, where he is involved in process technology, physical design methodology, and power management. Konstadinidis has a PhD in microelectronics from the Technical University of Berlin. Contact him at [email protected].

Serena Leung is a senior principal hardware engineer at Oracle. Her interests are multiprocessor cache and coherency design. Leung has an MS in electrical engineering and computer science from the University of California, Berkeley. Contact her at [email protected]. Paul Loewenstein is a senior principal hardware engineer at Oracle. His research interests include multiprocessor cache coherence protocols, formal memory models, deadlock and livelock avoidance, and error detection and correction. Loewenstein has a PhD in computer science from the University of Cambridge. Contact him at paul.loewenstein @oracle.com.

computer architecture, computer arithmetic, performance modeling, and on-chip networks for multicore architectures. Smentek has an MS in electrical engineering from Stanford University. Contact him at david [email protected]. Thomas Wicki is a senior principal hardware engineer in Oracle’s Microelectronics Architecture group. His research interests include scalable system architectures and interconnect designs. Wicki has a PhD in electrical engineering from the Swiss Federal Institute of Technology, Zurich. Contact him at [email protected].

Curtis McAllister is a senior director of hardware engineering at Oracle. His research interests include multiprocessor coherence, parallel processing, and timing optimization. McAllister has a BS in computer engineering from Iowa State University. Contact him at [email protected]. Stephen Phillips is a senior director of the Sparc Architecture Group at Oracle. His research interests include high-performance microprocessors, cache coherence, and hardware acceleration. Phillips has an MS in electrical engineering from Rutgers University. Contact him at [email protected]. Zoran Radovic is a senior principal engineer in the microelectronics division of Oracle. His research interests include architectural support for programming languages and operating systems, distributed computing, and software reliability. Radovic has a PhD in computer science from Uppsala University. Contact him at [email protected]. Ram Sivaramakrishnan is a director of hardware at Oracle. His research interests include low-power cache design for highthroughput multithreaded architectures and low-overhead multiprocessor coherency protocols. Sivaramakrishnan has an MS in electrical engineering from the University of Nevada, Las Vegas. Contact him at ram [email protected]. David Smentek is a senior hardware architect at Oracle. His research interests include

.............................................................

MARCH/APRIL 2015

45