DSPIN vs ANOC

which requires 10 IP-blocks from the complete FAUST platform. ..... used to build the clock tree of the DSPIN routers while supporting .... memory multi-threaded applications can efficiently .... Communication: Explorations on the Intel PXA27x.
736KB taille 139 téléchargements 301 vues
Physical Implementation of the DSPI etwork-on-Chip in the FAUST Architecture Ivan Miro-Panades1,2, Fabien Clermidy3, Pascal Vivet3, Alain Greiner1 1 The University of Pierre et Marie Curie, 75252 Paris, France 2 STMicroelectronics, 38921 Crolles, France 3 CEA-Leti, MI%ATEC, 38054 Grenoble, France {ivan.miro, alain.greiner}@lip6.fr, {fabien.clermidy, pascal.vivel}@cea.fr Abstract This paper presents a physical implementation of the DSPI% network-on-chip in the FAUST architecture. FAUST is a stream-oriented multiapplication SoC platform for telecommunications addressing IEEE 802.11a and MC-CDMA standards. The original asynchronous network-on-chip (A%OC) of FAUST has been replaced by the multi-synchronous DSPI% network-on-chip. In this paper, we analyze how the DSPI% network-on-chip, originally designed to support shared memory and multi-processors architectures, can support stream-oriented architectures. The physical implementation of both A%OC and DSPI% are presented. Finally, a comparison between A%OC and DSPI% designs in a 130nm technology is carried out in terms of area, throughput, packet latency, and power consumption.

1. Introduction Increasing the system performance by scaling the technology and the clock frequency become more and more complex due to the lower scalability of the wire delays. New approaches such as Network-on-Chip (NoC) architectures and the Globally Asynchronous, Locally Synchronous (GALS) paradigm tries to solve the design bottleneck by partitioning the circuit in small synchronous islands while they communicate asynchronously. Each island can be clocked by independent clock frequency, while the communications between neighbor islands are carried out by the NoC. Moreover, the NoC approach attempts to solve the bandwidth bottleneck of a central bus by splitting the communications over a plurality of routers and links. A large number of NoC architectures have been published, but there is very few detailed analysis of their physical implementation: SPIN [16,10], Tera-

scale [11], Silistix [14], and FAUST [8], Xpipes [17]. A 32-port SPIN NoC has been implemented in [10,13]. However, the architecture is not design-flexible and does not support the GALS approach because it was designed as a synchronous centric hard macro. The Tera-scale [11] architecture contains an 80-tile processor architecture interconnected by a NoC. A mesochronous technique was used to distribute a 4GHz clock signal over the 275mm² die. However, its network router takes 0.34mm² in CMOS 65nm, which is several times larger than the DSPIN router presented in this paper (when compared at the same fabrication process) while offering the same type of service. The Silistix CHAIN network [14] is based on packet switching using asynchronous QDI 4-rail links and is composed of basic elements such as muxes, demuxes, arbiters. The CHAIN architecture allows the GALS strategy but do not offer a real Network-on-Chip protocol, neither offer Quality-of-Service features. The DSPIN [2] network-on-chip is an evolution of the SPIN [16] architecture, and has been developed by the LIP6 laboratory, in cooperation with STMicroelectronics. DSPIN is a 2D mesh distributed NoC well suited to the GALS approach. Its architecture is synthesizable on standard synchronous cells library with neither custom nor asynchronous cells. The DSPIN architecture was initially designed to support shared memory multi-processors architectures. In this work, we present the physical implementation of DSPIN network in a stream-oriented, multi-application platform, the FAUST generic architecture, developed by CEA/LETI [8]. The main goal is to evaluate if a network architecture optimized for shared memory can efficiently support stream-oriented applications. In order to do this evaluation, we replaced the asynchronous network-on-chip (called ANOC) from FAUST chip by the DSPIN network-on-chip. The FAUST and ANOC architectures are firstly analyzed in section 2. DSPIN architecture is detailed in

section 3. Section 4 compares the two NoCs architectures and describes the migration from ANOC to DSPIN. In section 5, the sep-by-step implementation of FAUST with the DSPIN network is presented. Finally, in section 6 the ANOC and DSPIN designs are compared in terms of silicon area, throughput, latency, and power consumption.

2. FAUST Application FAUST, which stands for Flexible Architecture of Unified Systems for Telecom is a hardware demonstration platform for the 4MORE mobile terminals. 4MORE [1] is an IST program targeting 4G baseband modem chips. The FAUST project was initiated in 2003 for supporting multiple OFDM air interfaces in a single SoC. FAUST architecture (Figure 1) is composed by processing units interconnected by a NoC. It also includes an ARM946ES in an AHB subsystem. The communication protocol between the functional units is carried out by message passing through the NoC. Each processing unit contains a programmable Network Interface Controller, which contains input and output FIFOs and regulates the traffic through the network. This regulation is carried out by credits to synchronize the producer to the consumer on a self-synchronized data pipeline manner. NOC1 IF 84 Pads

JTAG

RAC

Clk & Reset CTRL

Clk, Rst

SPort APort EXP

TX units

OFDM MOD.

ALAM. MOD.

CDMA MOD.

MAPP.

BIT INTER.

NoC Perf.

RAM

ARM946

RAM

EXT. RAM CTRL

RAM IF 58 Pads

ROTOR

EQUAL.

CHAN. EST.

CONV. DEC.

ETHER NET

ETHERNET IF 17 Pads

FRAME SYNC.

ODFM DEM.

CDMA DEM.

DEMAPP.

DEINTER.

AHB

TURBO CODER

CONV. CODER

CDMA techniques, with a data rates up to 100 Mbits/s. In the paper, we focus on the Matrice receiver (RX), which requires 10 IP-blocks from the complete FAUST platform. For this application, the NoC interconnect must support an aggregated throughput up to 10.6 Gbits/s to maintain the real-time constraints imposed by the OFDM frame rate. An OFDM frame must be processed in less than 650µs. A detailed description of the frame composition and decoding method can be found in [9].

2.1. AOC Architecture ANOC stands for Asynchronous NoC and has been developed by the CEA-Leti [6]. ANOC is a wormhole packet switching NoC with 32bits payload. Its architecture is fully asynchronous and has been implemented using ST standard cells and the TIMA TAL library [15]. Its architecture is composed by five ports routers interconnected by bidirectional links using send/accept asynchronous handshake protocol (Figure 2). Thus, the ANOC protocol offers naturally a GALS architecture. As the ANOC routers are asynchronous, the entire end-to-end path traveling the packets is completely asynchronous. In addition, only the input and output ports of the NoC are resynchronized to the local synchronous IP frequency, using dedicated synchronization FIFOs [7]. Moreover, a four-phase protocol is used on the network guaranteeing no metastability issues. Only the input and output ports of the network, where synchronization is required, are susceptible to metastability failure.

AHB units RX units Async/ Sync IF Async node

EXP

NOC2 IF

SPort

83 Pads

APort

DART

Figure 1. FAUST architecture The FAUST chip is a multi-application platform for 4G telecom. It can support OFDM-based applications such as 802.11a standard, MC-CDMA [1,9] and 3GPPLTE protocols. All these applications share the same set of constraints, including real-time requirements, high throughput and low power consumption for battery-powered devices. In this paper, a SISO-MC-CDMA data-streaming application called Matrice [9] is addressed. It consists in transmitting and receiving frames using OFDM and

Figure 2. ANOC architecture The ANOC topology is not reduced to regular 2D mesh. Irregular 2D mesh or torus topology can also be implemented as ANOC uses a source routing algorithm.

Moreover, source routing can be used to minimize the congestion on some links, and thus reduce the packet latency. The first flit of the packet contains the routing information and the router uses this “path-to-target” to decide the correct routing destination. A flit is the smallest flow control unit of the network. The routing information is enclosed on 18-bits and two bits encodes each routing hop as shown in Figure 8a. Hence, the routing path is limited to nine hops H0 to H8. However, a path extension mechanism is also proposed to extend the routing path [6]. ANOC provides two virtual channels per NoC link. A low latency channel VC0 for real-time applications and a higher latency and lower priority channel VC1 for best effort traffic.

2.2. AOC Implementation The ANOC design has been implemented in the STMicroelectronics 130nm technology, using standard place-route tools (EncounterTM from Cadence). In the proposed architecture, two challenges need to be addressed: the physical implementation of the ANOC router, which is robust QDI 4-phase/4-rail asynchronous logic [6], and the implementation of the GALS interfaces [7].

R

S1

R

E2

R

E3

R

S0

R

Combinational QDI logic

R R

Eack

S2 S3

Sack

Stage n-1

Stage n

CLK Reset

R

E1

NP1

E0

Global Reset signal = Pseudo-clock

Exp.

rst_n

the tool using the reset signal of ANOC router logic (Figure 3). Using this pseudo-clock mechanism, asynchronous logic timing loops are broken, and static timing analysis of the QDI logic can be performed using standard tools. Due to the 4-phase protocol, the pseudo-clock frequency must be equal to 4 times the asynchronous targeted average frequency. Once the ANOC router hard macro was available, the standard abstract and gds files were generated. A pseudosynchronous timing model of the asynchronous ANOC router was also automatically generated using this pseudo-clock. For the GALS interfaces implementation, a softmacro approach was defined. This approach allowed to perform clock tree balancing with their attached synchronous unit, and also bring less constraints for top-level floor-planning. For top-level, the complete floor-planning was done in order to place all the hard-macros: ANOC routers, SRAM memories, ARM946 core (Figure 4). The place & route was done hierarchically with five distinct partitions using Encounter tool. Thanks to the ANOC router hard-macro, no top-level “spaghetti” routing and the common congestion drawbacks were observed at all, only parallel routing of ANOC link signals between ANOC routers was performed. The timing analysis and optimization of the NoC links was possible using the pseudo-synchronous timing model of the ANOC router. For the GALS interfaces, the timing optimization has been nevertheless more difficult due to mix-timing constraints of these interfaces [7].

Stage n+1

NP2

For the ANOC router, a hard-macro approach was defined in order to re-use the ANOC router all over the FAUST top floor-plan. This choice obviously allows proper placing of the ANOC router port signal pins (North, East, South, West, Unit). The ANOC router contains robust QDI logic, which is implemented using standard-cells and specific C-elements from the TAL library [15], jointly developed by the TIMA laboratory and CEA-Leti. Due to the un-clocked nature of the logic, in order to optimize place&route under timing constraints, a pseudo-clock has been emulated within

Exp.

Figure 3 QDI logic optimization with pseudo-clock

Figure 4. FAUST floor-plan with ANOC

Finally, due to the GALS nature of the chip, the clock-tree of the chip was constituted of 27 independent clock trees: one distinct clock tree per synchronous IP unit. The 27 clock-trees were then easily generated one-by-one by the tool with neither timing convergence problems nor floor-planning issues (clock congestion, and so on).

3. DSPI oC DSPIN NoC [2] stands for Distributed, Scalable, Programmable, Integrated Network. It is a wormhole packet-based NoC, with a 2D mesh topology. The packets are routed following the X-first deterministic routing algorithm. With this algorithm, packets are first routed on the X direction and then on the Y direction. The routing information on the packets is encoded by the absolute address of the destination subsystem on the first flit of the packet. Figure 8b shows the first flit and the following flits of the packet. DSPIN uses a generic flit size, which has been tuned to 34-bit flit in this implementation, providing a payload of 32-bits. South

(Y+1,X-1)

(Y,X-1)

Cluster(Y+1,X)

Cluster(Y,X)

(Y+1,X+1)

compatible with the Globally Asynchronous, Locally Synchronous approach, where synchronous islands or subsystems communicate asynchronously. Each DSPIN router is clocked by the network-clock frequency but a phase skew can exist between two neighbor routers. Moreover, each subsystem can have its own clock frequency, which can be independent of the network-clock frequency. Hence, the inter-router communication is mesochronous while the communication between router and subsystem is asynchronous. These communications are carried out by bi-synchronous FIFOs [3,4,5].

Figure 6. DSPIN topology and clocking regions

(Y,X+1)

South

(Y-1,X-1)

West

East

GS BE

East

GS BE

West

North

Local

Cluster(Y-1,X)

(Y-1,X+1)

North

Figure 5. DSPIN architecture In order to address the GALS issues, the DSPIN router itself is distributed, and is composed by five modules. Four of them are placed on the north, south, east, and west side of the subsystem. Finally, a local module communicates to the local subsystem through the Network Interface Controller (NIC). Figure 5 presents the DSPIN architecture and its modules. The local subsystem and the associated DSPIN router compose a synchronous cluster. With this approach, the longest wires are the intra-cluster wires (for example, from one input module on the west side to one output module on the east side), and cannot be longer than the cluster size. DSPIN is a multi-synchronous architecture synthesizable with standard cells. Moreover, it is

In order to avoid metastability situations on the mesochronous links, neighbor routers have inverted clock phases. Thus, the bi-synchronous FIFO is able to interface the mesochronous links even when the skew between neighbor routers is up to 50% of the clock period [3]. Figure 6 show a 4x5 network architecture. Each square defines a cluster, which contains a subsystem and a DSPIN router. The DSPIN routers on the black squares have its clock signal inverted while the ones on the white square have not. Consequently, neighbor routers always have inverted clock signals.

4. Migration of DSPI into FAUST In this section, the ANOC and DSPIN NoCs are firstly compared, and then ANOC is replaced by DSPIN within the FAUST architecture.

4.1. Architecture Comparison The DSPIN architecture, as explained in [2], was designed for generic shared memory multiprocessor architectures. In order to avoid deadlocks in request/responses traffic, DSPIN contains two fully separated sub-networks for requests and responses packets as shown in Figure 7a. Moreover, the DSPIN NIC is very simple because the routing address (Y, X),

can be directly extracted from the MSB bits of the destination address. Thus, the NIC does not require any configuration.

Table 1. ANOC and DSPIN architecture comparison AOC

Figure 7. Programming model Moreover, the DSPIN routing technology can be used with the message passing programming model as the one in the FAUST platform. The shared-memory network interface controller (NIC) must be replaced by a stream-oriented NIC (Figure 7b). This streamoriented NIC has to manage end-to-end flow control signals to avoid deadlock situations, maintain the endto-end FIFO synchronicity, and minimize the network congestion. For those stream-oriented applications, the, DSPIN architecture requires just one DSPIN router per cluster while on shared-memory approach it requires two.

Topology

Irregular

Router arity

5 port router

DSPI Regular 2D mesh 5 port router

Routing technique Source routing

Address-based X-First algorithm

Switching technique

Wormhole

Wormhole

Flit size

34 bits

34 bits (generic)

Flit payload

32 bits

32 bits (generic)

Flow control bits on the flit

Begin of packet (BOP) Begin of packet (BOP) End of packet (EOP) End of packet (EOP)

Routing overhead and capability

18-bits, allowing 9 routing hops. Path extension is possible

8-bits, allowing any architecture up to 16x16 clusters

Virtual channels

Best effort and Guaranteed service

Best effort and Guaranteed service

Programming model

Message passing

Shared memory (2 routers per cluster) Message passing (1 router per cluster)

Clocking scheme

Fully asynchronous (QDI) with GALS interfaces

Multi-synchronous with mesochronous interfaces

Flow control protocol

Send/accept asynchronous handshake

FIFO protocol (Write and WriteOk)

Metastability issues

Metastable-free inside Resolved by routers bi-synchronous FIFOs (4 phase protocol) GALS FIFO interfaces on the local ports

Clock tree

None

One per router

Physical implementation

Hard macro

Soft macro distributed on five modules

Long wires

Inter-router wires

Intra-cluster wires

Figure 8. ANOC and DSPIN packet definition DSPIN and ANOC use similar packet format as shown in Figure 8. DSPIN has a generic flit size and can be adjusted to fit the ANOC flit size, 34 bits. Thus, both architectures have a 32bits payload. ANOC uses 18 bits on the first flit for the source routing information while DSPIN uses 8 bits for the destination address. Furthermore, both architectures use the same flow control bits Begin_of_Packet (BOP) and End_of_Packet (EOP). Next table summarizes the ANOC and DSPIN similarities and differences.

4.2. Integration of DSPI in the FAUST Architecture Figure 9a shows the integration of the ANOC router into the FAUST architecture. The ANOC router is asynchronous while the NIC is synchronous. Therefore, the GALS interface module contains 4 FIFOs that perform the asynchronous communication between these two modules while buffering the data. On the other hand, DSPIN router uses different flow control protocol and routing technique than

ANOC, and integrates two of the four GALS interfaces FIFOs. Therefore, a protocol_conversion module was designed to interface the FAUST NIC with the DSPIN router (Figure 9b). This module adapts the flow control signals, converts the routing algorithm, and integrates two bi-synchronous FIFOs. The routing algorithm conversion was implemented with a Look Up Table (LUT) where the source-routing path of ANOC is recoded into the (Y,X) destination of DSPIN. This solution is not optimized as it uses a hard-wired LUT. However, this work was focused on the fair comparison of two NoC on the same architecture and not on the optimization of the architecture for each NoC. Otherwise, it would be required to modifying the NIC of FAUST. CLK_IP

CLK_IP

IP

IP

NIC

NIC

Synchronous SEND/ACCEPT GALS interface

Synchronous SEND/ACCEPT Protocol_conversion LUT

Asynchronous SEND/ACCEPT

ANOC router

Asynchronous READ/WRITE

DSPIN router Asynchronous SEND/ACCEPT

DSPIN architecture has been designed to be synthesizable on standard cells and easily implemented on a synchronous digital flow. Therefore, neither custom nor asynchronous cells have been used. The clock boundaries and the long wires issues have been analyzed to minimize the timing cloture effort [2].

5.1. Synthesis We used a hierarchical approach for the physical synthesis of the FAUST architecture with the DSPIN NoC. Each cluster was synthesized separately, before being assembled on the top FAUST architecture. Thus, no RTL synthesis was performed on the top level. The design was synthesized using STMicroelectronics CMOS 130nm low power standard cells. The timing constraints for the DSPIN routers synthesis were chosen to take into account the physical implementation. Thus, the DSPIN long wires (intracluster wires) were constrained with 300ps of propagation time. Moreover, low power standard cells with low Vt transistors were uses in conjunction with clock-gating techniques to minimize the power consumption.

Mesochronous READ/WRITE CLK_NoC

a) ANOC IP template

5. DSPI Implementation

b) DSPIN IP template

Figure 9. IP integration detail The FAUST architecture does not follow a regular 2D mesh topology, which is imperative for the DSPIN NoC. Therefore, some FAUST connections were rearranged in order to respect a regular mesh topology. The FAUST mapping topology was designed for a source routing algorithm. However, when the deterministic DSPIN X-First routing algorithm was used, some GS routing conflicts appeared, because the source routing algorithm can avoid using some congested links, which is not possible with a deterministic routing algorithm. In order to respect the real-time constraints while avoiding modifying the mapping topology, the DSPIN BE and GS FIFOs were dimensioned to 7 words depth to support these routing conflicts. Finally, the system performance for the reference OFDM application described in section 2, were equivalent for both ANOC and DSPIN networks. A more detailed and systematic comparison using synthetic traffic, is described in section 6.

5.2. Floorplanning The DSPIN routers are not built using hard macros. They are placed and routed as standard cells modules. This property gives to the designer the flexibility to decide the best shape and position to place the DSPIN router modules. Hence, the routers shape and position is constrained using regions. A region is a floorplanning delimiter that conditions all the cells of a module to be placed inside the defined area. However, the region does not define an exclusive area, because cells of other modules can be placed inside this area. The DSPIN routers were designed using five regions, one for each module (North, South, East, West, and Local). The placement density for these regions was tuned around 70%. Figure 10 shows the FAUST floorplan using DSPIN routers. The clusters areas are delimited by the colored rectangles while the N, S, E, W, and L filled boxes denote the North, South, East, West, and Local DSPIN router modules respectively. The top left and bottom red boxes are reserved for the RAC and DART hard macros. However, these modules are not used by the Matrice application and they were not implemented. Nevertheless, its area was reserved for fair comparison.

RAC

Ala. N NP1

N OFDM mod.

EW

L

CDMA N Mod. CLK

EW

W S N

Bit. Inter.

N

Turbo Dec.

L

L E E W S E W S N

L N S

L

Mapp.

N

Conv. Codec.

N

S

ARM946

Ext. RAM Ctrl.

RAM2 N

RAM1

L

L EW Rotor S N

Equal. Frame sync.

L N L W E S N

EW

L

S N

L

E W S N

OFDM demod. L

EW L

E

EW L

CDMA Dem. S

E

S N

Conv. Dec.

E L W S N

S N

L W S

EW Channel Est.

NP2

W

E S W

EW

L

L W S N

Ethernet S E N

Demapp. W E L

S

Deinter. E W L

S

E

DART

Figure 10. FAUST floor-plan with DSPIN

interfaces these communications without a complex back-end flow. The timing constraints file has to be properly defined to guarantee a correct tool operation. • For the asynchronous interfaces, the set_false_path condition is set between the clock signals of independent clock frequency. Hence, the tool understands the asynchronous nature of this kind of interfaces. Otherwise, the tool tries unsuccessfully to synchronize non-synchronous interfaces. • For the mesochronous interfaces, a set_multi_cycle_path condition is added on the output ports of the FIFO data registers. This condition informs the tool that the content of the FIFO data registers are not written and read on the same clock cycle. By-construction, the writing and later reading of bi-synchronous FIFO data register is delayed by the synchronization latency [3]. Hence, the data is stable when it is read, the timing paths are less constrained, and the tool can easily interface the mesochronous interface.

5.3. Clock Tree We have added a buffer or an inverter on the clock input of each DSPIN router for the mesochronous communications (Figure 6). These buffers/inverters are used to build the clock tree of the DSPIN routers while supporting the GALS approach (Figure 11). The clock tree implementation follows four steps: 1. The buffer/inverter on the clock input pin of each DSPIN router is manually placed in the middle of the area occupied by the cluster. Thus, the router clock-tree wires are as short as possible. 2. A clock tree is synthesized for each DSPIN router. The starting point of the clock-tree is the buffer/inverter on the clock input pin of the router. Each clock tree is synthesized with 5% skew target. 3. Each clock tree is characterized with its input delay, skew, and input capacitance. 4. A top clock tree is synthesized to balance the clock trees of all the DSPIN routers. Following the GALS approach, the top clock tree is balanced with a 30% skew while the leaves have a 5% skew.

5.4. Mesochronous and Asynchronous links The communication between neighbors routers are mesochronous as the clock tree is not equilibrated while the communications between router and subsystem are fully asynchronous because they use different clock frequencies. The bi-synchronous FIFO

Figure 11. DSPIN clock tree

6. etwork Comparison In this section, the ANOC and DSPIN implementations are compared in terms of area, throughput, latency, and power consumption, using synthetic workload.

6.1. Area The ANOC router was implemented as a hard macro. Its area is 0.21mm² with a cell density of 95%. The GALS interface module is implemented as a soft macro and its area is computed assuming a 95% of cell density. On the other hand, DSPIN is implemented as a soft macro and no area is exclusively reserved for the router. Assuming a 95% integration density, the total area is computed in Table 2 taking into consideration the DSPIN clock tree and the FIFO area of GALS

interfaces (ANOC requires 4 FIFOs while DSPIN requires 2). The total DSPIN area is 33% smaller than the ANOC area. Table 2. ANOC and DSPIN router area comparison AOC

DSPI

Router

0.211 mm²

0.161 mm²

Interface GALS

0.070 mm²

0.024 mm²

Clock tree

0.000 mm²

0.0016 mm²

Total

0.281 mm²

0.187 mm²

6.2. Throughput The throughput on the ANOC router depends on the fabrication process, on the voltage applied, and on the temperature condition. The throughput of ANOC is 160Mflit/s in worst-case and 220Mflit/s in nominal conditions. Moreover, the asynchronous circuits have the advantage to auto-adapt its performances to the process, temperature, and voltage of the circuit. The DSPIN router throughput depends on the clock frequency. Its throughput is one flit per clock cycle (1Mflit/s for a clock frequency of 1MHz). The throughput for the DSPIN router is 289Mflit/s in worst-case and 408Mflit/s in nominal-case. Table 3 shows the throughput comparison between the ANOC and DSPIN routers. On a real implementation, the ANOC will operate on its nominal conditions 220Mflit/s while the DSPIN router should be clocked not far away from the worst-case condition 289MHz to improve the fabrication yield. Table 3. Throughput comparison AOC

DSPI

Throughput on worst-case conditions

~ 160Mflit/s

≤ 289Mflit/s

Throughput on nominal conditions

~ 220Mflit/s

≤ 408Mflit/s

In terms of critical path analysis and cycle time for long distance communications, the ANOC critical path crosses four times the long wires in between ANOC routers while DSPIN crosses just one time. This comes from the fact that ANOC uses a 4-phase QDI asynchronous protocol. Thus, the long wire delay has four times higher influence on the ANOC router rather than on the DSPIN router. Consequently, on deep submicron technologies where the interconnect delays

will be higher than the gate delays, a multisynchronous architecture as DSPIN would have higher packet throughput than an asynchronous one as ANOC. Fortunately, pipeline stages can be inserted on the long wires in order to cope with these delays and improve the throughput, despite of the added latency.

6.3. Packet Latency The minimal Packet Latency is the end-to-end delay between the time a packet header enters into the network and the time it exits the network, assuming no contention. This path can be decomposed in three parts: First, Intermediate, and Last latency [12]. The First latency is the time it takes the packet to cross the GALS interface and the first router. The Last latency is the time it takes the packet to cross the last router and the GALS interface. The Intermediate latency is the time it takes the packet to cross an intermediate routers between the first and the last router as shown in Figure 12.

Figure 12. Packet latency definition The latency of the ANOC routers was measured on the real implementation of the FAUST circuit. The ANOC Intermediate router latency is 6.8ns and does not depend on the clock frequency. On the other hand, DSPIN router latency depends on the network and subsystem clock frequencies. Table 4 shows the latency comparison between ANOC and DSPIN when the subsystem is clocked with 150MHz or 250MHz. Moreover, these frequencies are also used to clock the DSPIN routers. Table 4. Latency comparison between ANOC and DSPIN routers AOC

DSPI

F = 150 MHz Intermediate router latency First + Last latency

AOC

DSPI

F = 250 MHz

6.80 ns

16.66 ns

6.80 ns

10.00 ns

60.00 ns

56.66 ns

47.00 ns

34.00 ns

The ANOC intermediate router latency is lower than the DSPIN one. This comes from the fact that the DSPIN routers resynchronize the data packets on each hop. To obtain the same intermediate router latency, the DSPIN router should be clocked to 367MHz. Moreover, the first and last router latency is better optimized on the DSPIN side. Table 5 shows the latency of the ANOC and DSPIN router for 5 and 9 hops path. It is clear that the ANOC router have lower latency than the DSPIN router for low clock frequencies. However, the latencies are quite similar when the DSPIN clock frequency increases. Table 5. Latency analysis for 5 and 9 hops path AOC

DSPI

F = 150 MHz

AOC

DSPI

F = 250 MHz

Latency for 5 hops path

80.00 ns

106.66 ns

68.00 ns

64.00 ns

Latency for 9 hops path

106.66 ns

173.30 ns

96.00 ns

104.00 ns

6.4. Power Consumption In order to analyze the power consumption of the NoC architectures, back-annotation gate-level simulations were performed on both architectures. The back-annotation data was extracted from the physically implemented architectures using typical conditions. Both architectures computed the same OFDM frame demodulation using real functional traffic in order to compute accurate power consumption estimations. Table 6 shows the detailed power consumption analysis for both architectures. The GALS interface power corresponds to the power consumption of the 4 FIFOs on the GALS interface module of ANOC, and the 2 FIFOs on the protocol_conversion module of DSPIN. Table 6. Power consumption comparison AOC

DSPI F = 150 MHz

F = 250 MHz

Router

2.07 mW

2.89 mW

4.85 mW

GALS interface

1.62 mW

0.56 mW

0.81 mW

Clock tree

0.00 mW

2.44 mW

4.73 mW

Total

3.69 mW

5.89 mW

10.39 mW

The power consumption of the ANOC router is lower than the DSPIN router even at 150 MHz. This comes from the fact that DSPIN uses larger FIFOs (7 words depth compared to 2 words depth on ANOC). On the other hand, the GALS interface on DSPIN is more efficient that the one on the ANOC. Finally, DSPIN requires a clock tree that consumes as much power as the router itself. Consequently, the total power consumption of ANOC is at least 37% lower than the one of DSPIN for the same application.

7. Conclusion A physical implementation of the DSPIN networkon-chip on the generic, stream-oriented, FAUST platform has been presented. The multi-million gates FAUST architecture using the DSPIN network was physically implemented up to mask layout. The DSPIN architecture was adapted to manage stream-oriented communications. This adaptation was simple because both DSPIN and ANOC respect the OSI reference model. A dedicated wrapper has been designed to adapt the ANOC packet format into the DSPIN format without modifying the network interface controllers defined by the FAUST architecture. We demonstrated that a network architecture designed to support shared memory multi-threaded applications can efficiently support stream-oriented applications. The DSPIN implementation has similar performances as the ANOC implementation in terms of silicon area, throughput, latency, and power consumption. The area of DSPIN is 33% smaller than the area of ANOC. The maximum sustained throughput of DSPIN is 31% higher than ANOC throughput, considering that ANOC operates at nominal conditions and DSPIN in worst-case conditions. In terms of packet latency, DSPIN should be clocked at least to 367 MHz to obtain the same packet latency as ANOC router. However, at that frequency, the power consumption of the DSPIN router is three times higher than the ANOC one. Therefore, the ANOC NoC is a good candidate for low latency and low power applications, while DSPIN is more suited to low area and high performance applications. From a design-flow point of view, the multisynchronous DSPIN network is implemented using only standard cells and soft-macro conception. It does not use any asynchronous nor custom cells, giving to the designer a complete flexibility to control the floorplan of the circuit. On the other hand, ANOC is an asynchronous network requiring additional standard cells such as Muller gates and dedicated synthesis tools. Therefore, to hide the complexity of the

asynchronous logic, a hard-macro approach has been used for the ANOC router design, that helps the toplevel timing optimization. We demonstrated that the multi-synchronous DSPIN architecture can be simply and automatically implemented, and can directly be ported to other CMOS process technologies, as it is fully synthesizable.

Acknowledgments The authors would like to thank Didier Lattard, Yvain Thonnart, Edith Beigne, and Jean Durupt from CEA-Leti for their contribution to this work.

[8]

[9]

[10]

[11]

[12]

References [1]

[2]

[3]

[4]

[5] [6]

[7]

S. Kaiser et al., “4G MC-CDMA Multi Antenna system on chip for Radio Enhancements (4MORE)”, Proc. of 13th IST Mobile And Wireless Communications Summit, Lyon, France, June 2004. I. Miro-Panades, A. Greiner, and A. Sheibanyrad, “A Low Cost Network-on-Chip with Guaranteed Service Well Suited to the GALS Approach,” 1st Int. Conf. on Nano-Networks and Workshops (Nano-Net 2006), September 2006. I. Miro-Panades and A. Greiner, “Bi-Synchronous FIFO for Synchronous Circuit Communication Well Suited for Network-on-Chip in GALS Architectures,” First Inter. Symposium on Network-on-Chip (NOCS’07), pp. 83-94, Princeton, NJ , May 2007. I. Miro-Panades, “Buffer memory control device (Dispositif de commnade d’une memoire tampon),” Patent FR2899985, October 2007. I. Miro-Panades, “Control circuit for FIFO memory,” Patent pending. E. Beigne, F. Clermidy, P. Vivet, A. Clouard, and M. Renaudin, “An Asynchronous NoC Architecture Providing Low Latency Service and its Multi-Level Design Framework,” Proceedings 11th Int. Symp. on Advanced Research in Asynchronous Circuits and Systems (ASYNC'2005), pp. 54-63, March 2005. E. Beigne and P. Vivet, “Design of On-chip and Offchip Interfaces for a GALS NoC Architecture,” Proc. 12th Int. Symp. on Advanced Research in Asynchronous Circuits and Systems, ASYNC'2006, Grenoble, France, pp. 172-181, March 2006.

[13]

[14]

[15]

[16]

[17]

D. Lattard et al., “A Telecom Baseband Circuit-Based on an Asynchronous Network-on-Chip”, Proc. of Int. Solid State Circuits Conference (ISSCC’2007), San Francisco, USA, Feb. 2007. F. Berens et al., “Designing a multiple antenna MCCDMA SoC for beyond 3G,” Embedded Systems, San Francisco, USA, March 2005. A. Andriahantenaina and A. Greiner “Micro-network for SoC: Implementation of a 32-port SPIN network,” Design Automation and Test in Europe (DATE 2003) pp. 1128-1129, March 2003. S. Vanlgal et al., “An 80-Tile 1.28TFLOPS Networkon-Chip in 65nm CMOS,” ISSCC Dig. Tech. Papers, pp. 98-99, Feb. 2007. A. Sheibanyrad, I. Miro-Panades, and A. Greiner, “Systematic comparison between the asynchronous and the multi-synchronous implementations of a Network on Chip architecture,” in Proc. IEEE Design, Automation and Test in Europe (DATE’07), April 2007. A. Andriahantenaina, “Physical implementation of a 32-port SPIN micro-network (Implémentation matérielle d’un micro-réseau SPIN à 32 ports),” PhD thesis, The University of Pierre et Marie Curie, France, Jan. 2006. A. M. Scott et al., “Asynchronous on-Chip Communication: Explorations on the Intel PXA27x Processor Peripheral,” 13th IEEE Int. Symp. on Asynchronous Circuits and Systems (ASYNC'07), 2007. P. Maurine, J.B. Rigaud, F. Bouesse, G. Sicard, and M. Renaudin, “Static Implementation of QDI Asynchronous Primitives,” 13th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS 2003), Torino, Italy, pp. 181-191, Sept. 2003. P. Guerrier and A. Greiner, “A generic architecture for on-chip packet-switched interconnections,” Proc. Design Automation and Test in Europe (DATE’00), pp. 250-256, Mars 2000. A. Pullini et al. “NoC Design and Implementation in 65nm Technology,” First International Symposium on Networks-on-Chip (NOCS 2007), pp. 273-282, Princeton, NJ, 7-9 May 2007.