FAULT-TOLERANT XGFT NETWORK-ON-CHIP ... - Xun ZHANG

cult or even impossible to test them at full operation speed with traditional scan ...... with cluster traffic 80% of the traffic was transferred within the clusters and ...
137KB taille 1 téléchargements 250 vues
FAULT-TOLERANT XGFT NETWORK-ON-CHIP FOR MULTI-PROCESSOR SYSTEM-ON-CHIP CIRCUITS Heikki Kariniemi, Jari Nurmi Institute of Digital and Computer Systems, Tampere University of Technology P.O.BOX 553, 33101 Tampere, Finland email: [email protected], [email protected] and the complexity of the hardware and allows more complete testing of the SoCs than before [6, 7]. As the NOCs will be implemented as IP blocks, their testing and diagnosis can also be performed with software. The testing of the SoCs can be performed hierarchically in such a way that individual IP blocks will be tested separately by the processors which run test software which implement functional-level tests [8, 9]. This will also be necessary, because as the systems grow larger the gate-level testing becomes practically impossible due to long test times. In addition, because the SoCs will contain several clock domains, it might be difficult or even impossible to test them at full operation speed with traditional scan based test methods. Functional-level testing usually improves the quality of testing while it may also produce shorter test times. This is partially because it allows testing with true test cases at full operation speed, which makes it possible to find delay faults in addition to stuct-at faults. General issues related to the fault-tolerance of the computer networks have been presented quite thoroughly in [10]. The fault-tolerance is the ability of the network to operate correctly also when switch nodes and links can be defected. To be fault-tolerant the NOCs as well as the computer networks must implement recovery mechanisms for faults, because otherwise their operation will probably be blocked e.g. by corrupted packets or by broken links. Faults can be classified to static, dynamic, and transient faults. Static faults are permanent faults caused by e.g. defected circuits. Dynamic faults are also permanent but they appear randomly during the operation. The permanent faults, which are actually structural faults, can be eliminated from the system with a suitable repair method which reconfigures the system to avoid using the faulty parts. Transient faults, which occur randomly, are not permanent. They may be caused by e.g. alpha particles. They can be detected by simple parity bit checks and eliminated by removing or correcting the corrupted packets. Different methods have been developed for implementing fault-tolerant routing with wormhole routing technique

ABSTRACT This paper presents a fault-tolerant eXtended Generalized Fat Tree (XGFT) Network-On-Chip (NOC) implemented with a new fault-diagnosis-and-repair (FDAR) system. The FDAR system is able to locate faults and reconfigure switch nodes in such a way that the network can route packets correctly despite the faults. This paper presents how the FDAR finds the faults and reconfigures the switches. Simulation results are used for showing that faulty XGFTs could also achieve good performance, if the FDAR is used. This is possible if deterministic routing is used in faulty parts of the XGFTs and adaptive Turn-Back (TB) routing is used in faultless parts of the network for ensuring good performance and Quality-of-Service (QoS). The XGFT is also equipped with parity bit checks for detecting bit errors from the packets. 1. INTRODUCTION The functions of the new System-on-Chip (SoC) circuits will be increasingly implemented with several programmable processor and reconfigurable logic blocks. These Multi-Processor SoCs (MPSoCs) are actually design platforms [1, 2], which can be modified for several different purposes by programming. In wider sense the design platforms include also design methods in addition to the system architectures and the intellectual property (IP) blocks used in the MPSoC circuits [2, 3]. The usage of design platforms is a consequence of the demand for shorter design times. The design platforms can be used together with e.g. a processor centric design flow [3] which generates application specific processors optimized for running some particular applications. Their communication infrastructures will resemble current computer networks and software processes will use advanced communication protocols for interprocess communication [4, 5]. As a natural consequence of the increased amount of the software, the logic cores of the future MPSoCs will also be tested and diagnosed by the software, which reduces the size

0-7803-9362-7/05/$20.00 ©2005 IEEE

203

4

A:

MANAGEMENT INTERFACE

SYSTEM MANAGEMENT

B:

SYSTEM MANAGEMENT

MANAGEMENT INTERFACE

3

2

A

A

1

A

B

A

A

B

B

A

B

C

B

B

A

B

C

C

C

D

C

D

D

C

D

D

D

C

D

0 0 63 31

Fig. 1. XGFT(3,4,4,4,4,1,1) (A) and XGFT(3,4,4,4,3,2,1) (B). pede the operation of the NOC any more. After this it configures the routing tables of the processors respectively so that the packets would be routed around the faulty part of the network. In the XGFTs the fault-tolerant routing can be implemented simply by using at first the FDAR for finding the faults. After this, deterministic routing is used in the faulty parts of the network for routing packets around faulty switches and links while adaptive routing can be used in faultless parts of the XGFT. The FDAR allows the XGFTs to be diagnosed also later during the operation of the system. This paper presents the operation and the implementation of the FDAR. Although the FDAR will be partially implemented with software, the focus of this presentation is on the hardware, i.e. the switch, which must implement simple mechanisms for detecting the faults and for enabling the repair by reconfiguration. This paper is organized as follows. Section 2 presents the topology of the XGFTs from the point of view of the fault-tolerance. Section 3 presents the operation of the adaptive Turn-Back (TB) and deterministic routing in XGFTs. Section 4 presents the operation of the FDAR system and its implementation. Section 5 presents simulation and synthesis results which are used for justifying the usability of the FDAR for improving the fault-tolerance of the XGFTs. Finally Section 6 concludes this paper briefly.

which is commonly used in computer networks owing to a small routing latency. One of them is misrouting backtracking with misroutes (MB-m) protocol used with pipelined circuit switching (PCS) [11] which is a variant of the wormhole routing mechanism. In MB-m the first flow control unit (flit), i.e. the header of the packet, is used for probing the network ahead of the packet. If the first flit is blocked and it can not be routed forward, it is routed backward along the same path to the previous node and routed again forward along an alternative path. Finally, if there are any faultless routing paths between the source and destination nodes, they are found and the packet can be routed to its destination. The drawback to the MB-m is that the architecture and operation of the switch nodes become complex. Another suitable choice, which could also be adapted for the XGFTs, is a fault-tolerant compressionless routing (FCR) [12] used with the wormhole routing. In FCR all of the transmitted packets are as long as the routing path from the source to destination. When the packet header arrives at the destination the padding from the end of the packet is removed while only the actual payload is transferred to the destination. If the packet is blocked by another packet or a fault during the transmission, the packet source detects it as a stop of transmission, starts its timer, and starts to wait. If the waiting lasts long enough, it sends a kill-signal (FKILL) to the network in order to remove the packet. The FCR implements actually a simple on-line diagnosis algorithm which is also used together with some repair method [12]. In XGFTs [13, 14] the fault diagnosis must be used to locate faults, because there is only one routing path available in the XGFTs for routing the packet downwards from the nearest common ancestor to its destination. If there is a fault in this path and the packet is blocked, it blocks other packets and finally the whole network may be blocked. This paper presents a new fault-diagnosis-and-repair (FDAR) system which improves the fault-tolerance of the XGFTs. The operation of the FDAR resembles that of the FCR as it is diagnosing the NOCs. If it finds a fault, it reconfigures the faulty switch node in such a way that the fault does not im-

2. THE FAULT-TOLERANT XGFT 2.1. The XGFT Topology with Redundant Resources The XGFT topology can be defined as XGFT tuple where is the number of switch stages, parameter is the number of child nodes of the witch nodes in stage , and parameter is the number of their parent nodes [13, 14]. Switches in stage one, which is the lowest stage, are connected to processor leaf nodes. Because the degree of switches on different stages of the XGFTs can vary, the XGFTs are more scalable for different system sizes and performance requirements than Fat Trees. This is illustrated in Fig. 1A and 1B where large white squares are switches and small black

204

B:

A: PUR[0] PUR[1] PUR[2] PDR[0] PDR[1] PDR[2]

OP4 OP5

OP6

4×(32+4) bits

IP0

IP1

IP2

IP3

IP4

IP5

Parity bits (4) VRC bits (9) Blocked Remove Op_Enable

IP6

3×(32+4) bits

OP0 OP1 OP2

Ack Req Data bits (32)

OP3

CUR[0] CUR[1] CUR[2] CUR[3]CDR[0]CDR[1]CDR[2] CDR[3]

Ack Req Data bits (32) Parity bits (4) VRC bits (9) Blocked Remove Op_Enable

Parity bits & Data

Handshaking Signals (HS)

Parity bits & Parity bits & Data Data Error IP−FIFO VRC Error VRCs Push− Pop−HS HS

VRCs C: & Req Rx & Ack

Parities & Data

BE− FIFO

HS

Op_Enable

Link

IP− BIST− CTRL Ip_Enable

Blocked Remove

Channels

D:

N x (Parity bits & Data) Parity bits & Data OP− OP−FIFO MUX Request (N) Grant (N) Accept (N) OP− Reserve (N) CTRL Push−HS Remove (N)

Request (K) Grant (K) IP− Accept (K) CTRL Reserve (K) Push−HS Blocked (K) Remove

Parity bits & Parity Data & VRCs bits & Data Tx & Req Pop− VRCs Ack HS

OP− N=The number of input ports BIST− K=The number of output ports Push− CTRL HS Op_Enable Op_Enable Remove Blocked

Fig. 2. The struture of the 3x4-switch (A), the struture of the communication links (B), and the structure of the input ports (C) and the output ports (D). squares are processor nodes. The stages are numbered in descending order starting from the top of the XGFT, and the leaf nodes are numbered in ascending order starting from the leftmost leaf node. For example, XGFT(3,4,4,4,4,1,1) has more routing resources in its four sub-XGFT(2,4,4,4,1)s than XGFT(3,4,4,4,3,2,1) in its four sub-XGFT(2,4,4,3,2)s. Therefore, it is able to achieve higher throughputs with local cluster traffic. Fig. 1A and 1B depict also how management interfaces, which are actually also switch nodes, could be integrated into different XGFTs. XGFTs have a lot of redundant routing resources which are switches and bi-directional transmission links, which can be utilized in improving the fault-tolerance of the XGFTs. Fig. 1B illustrates how the fault tolerance of the system could be improved by using two network interfaces at each of the leaf nodes. Switches in stages one and two have been labeled with letters A, B, C, and D in order to show which sub-XFT of height two they belong to. In this interconnection scheme all of the interfaces have different addresses and it would be possible to transfer packets between any pair of two leaf nodes although e.g. all of the switches and links of sub-XGFTs A and C would be defected. Furthermore, if any of the switches of stages three or two of the XGFT(3,4,4,4,4,1,1) would be defected, there would still be completely functioning switches and links in the network for communication between any pair of processor leaf nodes. This interconnection scheme also doubles the peak bandwidth usable by every processor although there would be two processors in every leaf node.

port. A pair of one unidirectional -port and one port forms one bi-directional P-port which connects the switch node to one of its parent nodes in the next upper switch stage. Respectively a pair of one unidirectional port and one -port forms one bi-directional C-port which connects the switch node to one of its child nodes in the next lower switch stage. The communication between switch nodes is performed across bi-directional links depicted in Fig. 2B. These links consist of two unidirectional channels which connect the -ports and the -ports of one switch to the -ports of another switch respectively. The ports and the channels consist of several signals. In the synthesized network configurations the flits consist of 32 data bits and 4 parity bits, i.e. every flit carries one parity bit per every data byte of eight bits. Packet sources compute the parity bits and switch nodes use them for detecting bit errors. It is also possible to use nine bits wide vertical redundancy checks (VRC) [15] where every bit is an even parity of four bits of the flits. The VRCs combined with the flits form actually a linear (45,36) block code of the flits. They could also be used for correcting one bit error from the flits. The request ( ) and acknowledge ( ) signals of the channels are used for controlling the transfer of the flits over the chan, and nels. The remaining signals are used by the FDAR system. Fig. 2C depicts the structure of the input port (IP) blocks and Fig. 2D depicts the structure of the output port (OP) blocks. The port blocks contain a link controller for receiving (Rx) or transmitting (Tx) the flits, and a FIFO (IPFIFO, OP-FIFO) for storing the flits. The Tx blocks generate the VRCs and the Rx blocks check them. The results of the checks are also stored into the IP-FIFO at the same time with the respective flits. The input port controllers (IPCTRL) and output port controllers (OP-CTRL) control the routing and the switching of the packets through the switch. The switch nodes have a distributed architecture which does

2.2. Switch Nodes with Distributed Architecture The switch nodes have a modular distributed architecture scalable for different degrees. Fig. 2A depicts a 3x4-switch node which can be connected to three parent nodes and four child nodes. Every input port (IP) and output port block (OP) forms one unidirectional , , , or -

205

not contain a centralized arbiter, but the IP-CTRL and the OP-CTRL blocks contain small input and output arbiters which implement a prioritized iterative SLIP (iSLIP) [16] arbiter. They use , , , and signals for arbitration. The switch nodes do not contain a separate crossbar either for switching the packets, but the output port blocks contain output multiplexers (OP-MUX) for switching flits transmitted from different input ports. After the arbitration has been finished, the IP-CTRL block controls the transfer of the packets from input ports to the output ports with -signals. Before the arbitration the IP-CTRL blocks check also the correctness of the four parity bits and the VRCs, and if they find a bit error from the packet header, the packet is removed from the network, which is done by transmiting it to a null output port. The IP-CTRL and OP-CTRL blocks implement also the FDAR system which uses and signals. Fig. 2C and Fig. 2D depict also built-in self-test (BIST) controllers (IP-BIST-CTRL, OP-BIST-CTRL) drawn with dash lines. The BIST is performed to the channels immediately after the reset of the system. The OP-BIST-CTRL block generates with a linear-feedback-shift-register (LFSR) test flits which are transferred through the OP-FIFO block, the channel, and the IP-FIFO block. At the same time the IPBIST-CTRL generates the same test flits with another LFSR and compares them with the received ones. The BIST fails if the number of errors exceeds some predefined limit. In this case the channel remains disabled. Otherwise, it is enabled by assigning value ’1’ to the and signals. The LFSRs produce 18 bits wide test datas which are used to produce 36 bits wide test flits. The OP-BISTCTRL does not generate separate test datas for testing the VRC, but IP-BIST-CTRL blocks count the VRC errors also as BIST errors.

MODE PRIOR. HEIGHT DEST. ADDR.

SOURCE ADDR.

OTHER FIELDS

LENGTH

OTHER FIELDS 4 TO 240 WORDS OF DATA (NORMAL MODE) OR PADDING (DIAGNOSIS MODE) CRC & THREE END−DELIMITERS

OPERATION MODES: "11" = ADAPTIVE ROUTING "10" = DETERMINISTIC ROUTING "01" = DETERMINISTIC ROUTING & DIAGNOSIS (PROBE) "00" = DETERMINISTIC ROUTING & DIAGNOSIS (REPAIR) END−DELIMITER: 1

0

0

0

0

0

0

0

0

Fig. 3. The structure of the packets and the operation modes of the network.

must be implemented with both deterministic and adaptive routing, and a suitable diagnosis method like the FDAR. Deterministic routing will also be used in the presented XGFT for communication between processor nodes and the system management. The structure of the packets is depicted in Fig. 3. The packets consist of a header of three 32 bits wide words, a variable length data of 4 to 240 words, and a trailer which carries a cyclic redundancy check (CRC) [15] and three end-delimiters. The header contains MODE, PRIOR, HEIGHT, DEST ADDR, and SOURCE ADDR fields used for controlling the routing and the transfer of the packets through the network. Other fields (OTHER FIELDS) are used for message flow control. The MODE-field controls the operation of the input port controllers which perform routing and arbitration. It determines how the packet is routed through the network. In ”11”-mode normal adaptive TB routing is used. In ”10”-mode packets are routed deterministically through the network. In ”01”-mode packets are routed deterministically and they are used for diagnosing the network, and packets routed in ”00”-mode are used for reconfiguring the network. The HEIGHT field is used in deterministic routing for determining the height at which the packet will be turned back and routed downwards to its destination. In adaptive Turn Back (TB) routing the SOURCE ADDR and DEST ADDR fields contain encoded source and destination addresses [14]. In deterministic routing the SOURCE ADDR field contains the routing paths upwards and DEST ADDR field downwards. Fig. 3 depicts also the end-delimiter which consist of one byte of ’0’-bits. Packet sources set their parity bits to value ’1’, which makes the end-delimiters distinguishable from the other data bytes which may also carry eigth ’0’-bits and have even parity. Actually the odd parity bit and the eight ’0’-bits together form one end-delimiter. Adaptive TB routing algorithm has simple implementation suitable for small switches of the NOCs. This has been achieved with address space encoding [14]. The encoded addresses consist of output -port numbers of the switch nodes along such routing paths which start at any of the topmost root switch nodes of the XGFT and end at the leaf node the address of which is encoded. When the TB routes the packets upwards it compares the most significant num-

3. ADAPTIVE TURN-BACK ROUTING AND DETERMINISTIC ROUTING IN XGFTS In the XGFTs the routing of the packets consists of two phases. In the first phase packets are routed upwards in the network until they arrive at the nearest common ancestor of the source and destination leaf nodes. In the second phase they are routed downwards from this switch node. In the first phase it is possible to route packets upwards adaptively along several alternative routing paths. Because there is only one path available for routing the packets downwards from the nearest common ancestor, packets must be routed downwards deterministically according to their destination addresses. A deterministic routing could be used for routing the packets also upwards in the XGFT and around faulty parts of the XGFT, if the locations of the faults are known. This requires that the faults must be at first found and located by diagnosing the network. Therefore, fault-tolerant XGFTs

206

bers of the encoded source (SOURCE ADDR. ) and destination (DEST. ADDR. ) addresses. If they are not equal, it routes packets further up-ports like Fig. 4A wards through any of the free output illustrates. Otherwise it routes them downwards through one -ports the number of which it cuts from of the output the encoded destination address. From the topmost stage the TB routes packets always downwards. Packets can be routed upwards deterministically by specifying the up-routing path with a sequence (SOURCE ADDR ) of output -port numbers of the switches along the path like Fig. 4B illustrates. In the deterministic routing packets are routed upwards through -ports until they arrive at a stage from which packets are routed downwards. The height of this stage is determined with the HEIGHT-field. In both the adaptive TB routing and the deterministic routing packets are routed downwards deterministically like Fig. 4C illustrates. In the deterministic down-routing packets are routed downwards through -ports. The uprouting of the packets is performed by the -port blocks and the downrouting by the -port blocks of the switches. Both of the routing algorithms test the correctness of the flits. If the packet header contains bit errors (parity or VRC errors) the packet is removed, which ensures that transient errors do not disrupt the operation of the network. This is because the packet headers are used for controlling the routing and switching of the packets through the switches and if their fields would contain bit errors, packets would most probably be routed incorrectly or blocked. When the packets are removed, the switches can still find the packet trailers by finding the end-delimiters although the LENGTH-fields would be corrupted. If the IP-CTRL finds at least two of the three end-delimiters simultaneously, it finds the trailer. The removed packets must be retransmitted later.

which may carry a test data for testing the links. Their total length must be at least HEIGHT flits where parameter HEIGHT is the height of the stage where the packet is and turned back. Parameters are the storage capacities of the input (IP-FIFO) and output (OP-FIFO) buffers in flits respectively, and parameter the size of the input port back-end FIFO (BEFIFO). Owing to the usage of the wormhole routing technique the switches were implemented with small IP-FIFOs and OP-FIFOs of eight flits only. The length of the probepackets must also be incremented by the number of the channels multiplied by two, because the link controllers contain registers for storing the flits.

4. THE FAULT-DIAGNOSIS-AND-REPAIR (FDAR) SYSTEM

All of the perfectly functioning processor nodes diagnose all of the possible routing paths to all the other processor nodes one after another. The diagnosis is performed under the control of the centralized system management so that the probe-packets transmitted from different processors would not block each other and cause confusions. The processors do not send more new packets to the network before the previous transmission has been completed. If the whole network is faultless, the processors can use adaptive TB routing. Otherwise, they must use also deterministic routing side by side with the TB routing. For example, if only one of the sub-XGFTs of the XGFT would contain faulty switches or channels, all of the packets transmitted to that sub-XGFT and within it should be routed with deterministic routing while packets transmitted to all the other sub-XGFTs could be routed with the adaptive TB routing.

As the packet sources diagnose the network they determine both the up-routing paths and the down-routing paths before they transmit the probe-packets to the network. They use timer for counting the blocking times and are able to locate the faults, if the probe-packets are blocked somewhere in the network for a longer time and -signals are asserted. If faults are found the sources reconfigure the network by transmitting repair-packets with ”00”-mode along the same paths as the probe-packets to the same destinations. The repair-packets must be at least flits long so that they would occupy the buffer space of at least two successive input and output port blocks and so that the IP-CTRL blocks would be able to notice their blocking. The last IP-CTRL blocks along the routing paths remove the blocked packets after their timers have launched timeouts. Alternatively, the sources can remove the repair-packets and reconfigure the -signals if their timers lauch network by asserting timeouts and long enough repair-packets are used. As processors find faulty paths they store information of them into the routing tables so that they could avoid using the same routing paths later.

4.1. The Operation of the FDAR The new FDAR system assumes that faults block the packets. If the packet can be routed through the network the fault is not serious enough. The FDAR eliminates such faults which block the packets by reconfiguring the switches. The FDAR system will be partially implemented with a diagnosis software run by the processors. It performs functional level tests to the network in two phases while it diagnoses its operation. In the first phase the processors probe the network in order to find and locate permanent static faults, which they do with probe-packets transmitted with ”01”mode. The probe-packets are as long as the size of all of the buffers along the routing path from the source to the destination. They consist of the header, the trailer and a padding

The operation of the FDAR is described in Fig. 4D. This

207

A: PART 1. UP−ROUTING IN STAGE L (L=1,... , h): (TURN−BACK) IF "switch is in the top−most stage, i.e. L = h" THEN IF "CDR[d L]−port is unmasked" AND NOT " parity (or VRC) errors" THEN "Route packet downwards through C DR[d L]−port"; ELSE " Remove packet" ; END IF; ELSIF "d hd h−1... dL+1 = s hs h−1... s L+1" THEN IF "CDR[d L]−port is unmasked" AND NOT " parity (or VRC) errors" THEN "Route packet downwards through C DR[d L]−port"; ELSE " Remove packet" ; END IF; ELSE IF "Not all PUR−port are masked" AND NOT " parity (or VRC) errors" THEN "Route packet upwards through any of the unmasked PUR−ports"; ELSE " Remove packet" ; END IF; END IF;

B: PART 1. UP−ROUTING IN STAGE L(L=1,... , h):(DETERM.) IF "switch is in stage L = Height" THEN IF "CDR[dL]−port is unmasked" AND NOT " parity (or VRC) errors" THEN "Route packet downwards through CDR[d L]−port"; ELSE " Remove packet" ; END IF; ELSE IF "PUR[sL]−port is unmasked" AND NOT " parity (or VRC) errors" THEN "Route packet upwards through PUR[sL]−port"; ELSE " Remove packet" ; END IF; END IF; C:PART 2. DOWN−ROUTING IN STAGE L (L=1,... , h): (TURN− BACK & DETERMINISTIC) IF "CDR[d L]−port is unmasked" AND NOT " parity (or VRC) errors" THEN "Route packet downwards through CDR[dL]−port"; ELSE " Remove packet" ; END IF;

D: THE OPERATION OF THE FDAR IF "Mode is " 11" or " 10" " THEN −−"NORMAL MODE" IF "Counter > TIMEOUT_NORMAL" AND " Blocked = ‘0’" THEN "Remove packet"; ELSE "Continue routing packet normally forward"; END IF; ELSIF "Mode is " 01" or " 00" " THEN −−"DIAG. MODE" IF "Counter > TIMEOUT_DIAG" OR " Remove = ‘1’" THEN IF "Blocked = ‘0’’" THEN IF "Mode is " 00" " THEN "Reconfigure switch and remove packet"; ELSE "Remove packet"; END IF; ELSE "Continue routing packet normally forward"; END IF; END IF; END IF;

Fig. 4. Up-routing with adaptive Turn-Back algorithm (A) and with deterministic routing (B). Deterministic down-routing (C).The operation of the FDAR at the switch nodes (D). ured only when the blocked packet is a repair-packet. Packets can be transferred from the input ports only to the output ports which are not masked by either of the port mask registers. If suitable unmasked output ports are not available, the packet must be removed.

part is performed by the IP-CTRL blocks of the input ports. The usage of the FDAR requires that the switches are able to detect the blocking of the packets. Therefore, in the current implementation of the XGFT every IP-CTRL block contains two ten bits wide timer counters for controlling the duration of the arbitration and the pauses of the packet transfers from the input ports to the output ports. If either the arbitration or the transfer takes too much time, the corresponding timer triggers a timeout. The timers are always restarted as a new arbitration starts or the packet transfer must be paused. Two separate timers are needed because the arbitration of the next packet can be performed while the transfer of the previous packet is still going on. The IP-CTRL blocks send information of the blocking backwards to the previous switch nodes along the routing path by assigning value ’1’ to their output -signal. As a consequence of this, only the last IPCTRL block along a routing path the input -signal of which has value ’0’ is able to remove the packet. It removes the packet if either one of its timers has lauched a -signal has value ’1’. The ontimeout or its input line FDAR could be implemented by probing the XGFT in ”00”-mode and by using the -signal for removing the packets from the network, even though it would lead to an uneconomic usage of the routing resources. The IPCTRL blocks perform also congestion management functions in the normal mode of operation. The packets transmitted in normal ”11” and ”10” modes have smaller timeout level (TIMEOUT NORMAL) than packets transmitted in diagnosis ”01” and ”00” modes (TIMEOUT DIAG) so that their would not cause even worse congestion in the network by blocking other packet as they become blocked.

4.2. The Fault Model It is usually assumed that the channels and routers have mechanisms for detecting errors and that the faulty parts of the networks can be reconfigured on-line in such a way that they can not disrupt the operation of the whole network. The routing protocols like MB-m and FCR usually just make use of the information of the existence of the faulty components [10, 11, 12]. The communication channels of the presented XGFT can also be tested optionally with built-in self-tests (BIST) [17]. If the BIST of some channel fails, the corresponding input and output ports are not enabled and used at all. Otherwise, the ports are enabled and start normal operation after the BIST. On the other hand, the whole network could also be considered a core under test (CUT) and the diagnosis could be considered a high abstration level functional test. In this case the network must implement methods like the FDAR for recovering from the faults as they occur. This requires also that the faults do not cause the most screwy events in the network and that the packets are just blocked or routed to wrong destinations as the faults occur. Below is a list of some less serious repairable faults which do not necessarily kill the whole system although they may block the packets. This list defines also a high abstraction level fault model for testing and diagnosing the networks at functional (or behavioral) level instead of traditional gate level.

The switch nodes of the presented XGFT are also reconfigurable. The IP-CTRL blocks have two dedicated port mask registers which contain one bit for every output port for storing information of faulty output connections. One of the mask registers is used by the input arbiter of the IPCTRL, and the other by the output controller of the IP-CTRL which controls the transfer of the packets from the input ports to the output ports. The port mask registers are config-

1. Packets may be routed to wrong destinations or be blocked if routing does not work properly. The misrouting can be noticed, because the destinations can send acknowledgements to packet sources with probepackets. The routing path must be marked unusable

208

A

and the faulty switch node can be reconfigured later, if it can be found.

B

450

160 det(fxd) det(rnd) adapt

2. The arbitration may last too long, if there are faults in the input arbiter or in the output arbiters or in both of them. This fault can be repaired by masking the requested output port. If the input arbiter is malfunctioning itself, it finally masks itself by masking all of the output ports.

140

350

Average Lat. [clk cycle]

Average Lat. [clk cycle]

400

300 250 200 150

50

120

100

80

60

100

3. The transfer of the packets may be blocked, if the channel between the output port of the switch and the input port of the next switch is faulty. This fault can be repaired by masking the output port connected to the faulty link.

det(fxd) det(rnd) adapt

0

5

10 15 Throughput [%]

20

25

40

0

10

20 30 Throughput [%]

40

Fig. 5. Performances of deterministic routing and adaptive TB routing with random traffic (A) and cluster traffic patterns (B).

4. If bit errors are detected from packet headers the last IP-CTRL blocks along the routing paths just remove the packets instead of forwarding them.

not able to change the paths although they would become congested and alternative paths would be available. Because of this, the network resources can not be optimally allocated for the packets, which reduces the performance of the network. The performance can be improved if the sources generate routing paths randomly for every packet instead of transmitting the packets along the same fixed paths associated with every destination. Like Fig. 5A and 5B show the deterministic routing achieves clearly higher performance with randomly chosen routing paths (det(rnd)) than with fixed routing paths (det(fixed)) with both uniformly distributed and cluster traffic. The performance of the network can be further improved by using adaptive routing like the TB which is able to take into account the state of the network when it does the routing decisions. Like Fig. 5A and 5B show the TB produces the highest performances (adapt) with both of the traffic patterns. The simulation results shown in Fig. 5B suggest also that processes should be mapped suitably in order to produce local cluster traffic as much as possible. This should be done particularly if, for example, there would be faulty top most roots in the XGFT(3,4,4,4,3, 2,1). Additionally, faulty switches in one of the sub-XGFTs would not affect the routing within the other sub-XGFTs and degrade their performances, if the communicating processes would be mapped to the same clusters.

The usage of the functional level testing and diagnosis does not exclude the usage of the other testing methods, but these methods could be used side by side with the FDAR when it is necessary. For example, it is possible to use atspeed BIST, which is able to find also dynamic faults, for testing the communication channels before the the diagnosis of the whole XGFT. This produces more reliably operating network, because as a consequence of the increased coverage of the testing a larger part of the faults can be repaired during the system diagnosis. The logic of the switches nodes could also be partially tested with scan-based test methods [17] while the FDAR would be used only for diagnosing and repairing the XGFT so that the system would operate as normally as possible unless it is very badly defected. 5. SYNTHESIS AND SIMULATION RESULTS The XGFT(3,4,4,4,3,2,1) was simulated in order to find out how much worse performance the deterministic routing produces than adaptive TB routing. The performance simulations were performed with uniformly distributed random traffic and cluster traffic patterns. The XGFT(3,4,4,4,3,2,1) was divided into four clusters of 16 leaves. In simulations with cluster traffic 80% of the traffic was transferred within the clusters and 20% of it was transferred between the clusters. In simulation results the throughput (Throughput) shows the proportion of the received average throughput per network interface to the maximum possible throughput in percentages. The latency (Average Lat.) shows the average header latency which is the elapsed time between the transmission and the reception of the packet headers in clock cycles. The simulations were performed with uniformly distributed packet lengths of 8 to 64 words. In deterministic routing the routing path is determined before the transmission of the packet. The switch nodes are

Four different XGFT configurations were also synthesized (130 nm CMOS, 300 MHz) with management interfaces in order to estimate the costs of the FDAR and the VRCs. Like the results in Table 1. show the usage of the VRC does not considerable increase the block areas as the FDAR and BIST are not used. Like the results in Table 2 show the insertion of the FDAR to the network implemented with the VRC increases the block areas approximately by 7% and the insertion of both the FDAR and the BIST increases them approximately by 31%. However, the increased block areas will not significantly reduce the chip yields, because the NOCs will consume relatively small pro-

209

[3] C. Rowen, Engineering the Complex SOC: Fast, Flexible Design with Configurable Processors. Upper Saddle River, New Jersey, USA: Prentice Hall, 2004.

Table 1. The synthesized block areas without FDAR and BIST, and without and with VRC. Without With Block VRC VRC 1x6-switch (stage 4) 1x4-switch (stage 3) 2x4-switch (stage 2) 3x4-switch (stage 1) XGFT(3,4,4,4,3,2,1)

0.358 mm 0.248 mm 0.295 mm 0.349 mm 10.970 mm

[4] L. Benini and G. D. Micheli, “Networks on chips: A new SoC paradigm,” IEEE Computer, vol. 35, no. 1, pp. 70–78, Jan. 2002.

0.375 mm 0.258 mm 0.308 mm 0.364 mm 11.443 mm

[5] J. Henkel, W. Wolf, and S. Chakradhar, “On-chip networks: A scalable, communication-centric embedded system design paradigm,” in Proceedings of the 17th International Conference on VLSI Design, Mumbai, India, Jan. 2004, pp. 845– 851. [6] A. Krstic, L. Wei-Cheng, C. Kwang-Ting, L. Chen, and S. Dey, “Embedded software-based self-test for programmable core based designs,” IEEE Des. Test. Comput., vol. 19, no. 4, pp. 18–27, July–Aug 2002.

Table 2. The synthesized block areas with VRC and FDAR, and without and with BIST. Without With Block BIST BIST 1x6-switch (stage 4) 1x4-switch (stage 3) 2x4-switch (atege 2) 3x4-switch (stage 1) XGFT(3,4,4,4,3,2,1)

0.403 mm 0.273 mm 0.330 mm 0.388 mm 12.209 mm

[7] H. Jing-Reng, M. Iyer, and C. Kwang-Ting, “A self-test methodology for IP cores in bus-based programmable SoCs,” in Proceedings of VLSI Test Symposium (VTS 2001), Marina Del Ray, California, USA, April–May 2001, pp. 198–203.

0.491 mm 0.337 mm 0.406 mm 0.478 mm 15.033 mm

[8] K. Jayaraman, V. Vedula, and J. Abraham, “Native mode functional self-test generation for systems-on-chip,” in Proceedings of the International Symposium on Quality Electronics Design, San Jose, California, USA, Mar. 18–21, 2002, pp. 280–285. [9] A. Jantsch and H. Tenhunen, Eds., Networks on Chip. Dordrecht, The Netherlands: Kluwer Academic Publishers, 2003, ch. 7.

portion of the total area of the MPSoCs.

[10] J. Duato, S. Yalamanchili, and L. Ni, Interconnection Networks: An Engineering Approach, 2nd ed. USA: Morgan Kaufmann, 2003.

6. CONCLUSION This paper presents a fault-tolerant XGFT where the new FDAR system is used for repairing static faults. The presented XGFT utilizes also parity bit checks for recovering from dynamic and transient faults. The logic area of the XGFTs implemented with the FDAR is not considerably higher than that of the XGFTs implemented without it, but the BIST of the communication links increases the area consumption clearly more. This paper presents also simulation results to show that the performance of the faulty XGFT is not necessarily much lower than that of the faultless XGFT if the FDAR is used for diagnosing and repairing the faults. Especially, if the processes would be mapped suitably to the processors so that they would produce a lot of local cluster traffic, the decrement of the performance could be quite negligible.

[11] P. Gaughan and S. Yalamanchili, “A family of fault-tolerant routing protocols for direct multiprocessor networks,” IEEE Trans. Parallel Distrib. Syst., vol. 6, no. 5, pp. 482–497, May 1995. [12] J. Kim, L. Ziqiang, and A. Chien, “Compressionless routing: A framework for adaptive and fault-tolerant routing,” IEEE Trans. Parallel Distrib. Syst., vol. 8, no. 3, pp. 229–244, Mar. 1997. [13] S. Ohring, M. Ibel, S. Das, and M. Kumar, “On generalized fat trees,” in Proceedings of 9’th International Parallel Processing Symposium, Santa Barbara, CA, USA, Apr. 1995, pp. 37–44. [14] H. Kariniemi and J. Nurmi, “Performance evaluation and implementation of two adaptive routing algorithms for XGFT networks,” in Proceedings of the 7th IEEE Workshop on Design and Diagnostics of Electronic Circuits and Systems, Stara Lesna, Slovakia, Apr. 2004, pp. 13–20.

7. REFERENCES

[15] J. Hammond and P. O’Reilly, Performance Analysis of Local Computer Networks. USA: Addison-Wesley, 1986.

[1] J. Nurmi, H. Tenhunen, J. Isoaho, and A. Jantsch, Eds., Interconnect-Centric Design for Advanced SOC and NOC. Dordrecht, The Netherlands: Kluwer Academic Publishers, 2004, ch. 1.

[16] N. McKeown, “The iSLIP scheduling algorithm for inputqueued swithes,” IEEE/ACM Trans. Networking, vol. 7, no. 2, pp. 188–201, Apr. 1999.

[2] J.-P. Soininen, A. Jantsch, M. Forsell, A. Pelkonen, J. Kreku, and S. Kumar, “Extending platform-based design to on chip systems,” in Proceeding of the 16th International Conference on VLSI Design, New Delhi, India, Jan. 2003, pp. 401–408.

[17] A. Crouch, Design-for-Test for Digital IC’s and Embedded Core Systems. Upper Saddle River, New Jersey, USA: Prentice Hall, 1999.

210