A flexible circuit switched NOC for FPGA based systems - Xun ZHANG

ware modules. ... The inter-module communication mechanisms employed on SOC's and ... gies argues for standardized communication mechanisms in-.
185KB taille 6 téléchargements 267 vues
A FLEXIBLE CIRCUIT-SWITCHED NOC FOR FPGA-BASED SYSTEMS Clint Hilton ∗

Brent Nelson

Electronic Technology Division Rincon Research Corporation 101 N. Wilmot Rd. Suite 101 Tucson, AZ 85711, USA email: [email protected]

Department of Electrical and Computer Engineering Brigham Young University Provo, UT 84602, USA email: [email protected]

ABSTRACT Increases in chip density due to Moore’s Law allow for the implementation of ever larger and more complex systems on a single chip. The communication mechanisms employed in such SOC’s is an important contribution to their overall performance. Networks on chip promise to overcome the scalability problems found in bus-based interconnect. Most work to this point has focused on packet-switched NOC’s. Circuit-switched networks are a lightweight alternative that promise high communication rates and predictable communication latencies. This paper describes PNoC, a very flexible circuit-switched NOC suitable for use in FPGA-based systems. Implementation results on a Virtex-II Pro device are given using an image binarization demonstration which resulted in a 2 − 23 × speedup with a 29 % area overhead compared to a shared bus implementation. 1. INTRODUCTION Increases in chip density due to Moore’s Law allow for the implementation of ever larger systems on a single chip. Known as systems on chip, or SOC’s, these systems usually contain a mixture of CPU’s, memories, and custom hardware modules. Such systems on chip can also be implemented on FPGA substrates, something we will refer to as Programmable Systems-On-Chip, or PSOC’s, in this paper. The inter-module communication mechanisms employed on SOC’s and PSOC’s has recently received significant attention for at least two reasons. First, traditional bus-based communication mechanisms do not scale well with increasing system complexity and become a bottleneck as system complexities continue to increase. Second, design and verification times for complex systems continue to grow – the desire for efficiencies in design and verification methodologies argues for standardized communication mechanisms instead of ad hoc direct module interconnections. ∗ This primary author conducted the research for this paper while at Brigham Young University

0-7803-9362-7/05/$20.00 ©2005 IEEE

Shared buses such as ARM’s AMBA bus [1] and IBM’s CoreConnect [2] are commonly used communication mechanisms in SOC’s and PSOC’s. They support a modular design approach that uses standard interfaces and allows for IP reuse [3], but the bus is often the performance bottleneck in a large system. Both Xilinx and Altera support a hybrid bus/direct-interconnect architecture which allows for direct module-to-module connections in addition to the bus interconnect. Hybrid approaches scale better than purely bus-based schemes, but complicate the design process since they reduce the modularity of the system and require custom hardware design for the module-to-module connections. Another alternative would involve the use of multiple buses or bus segments to alleviate the load on the main bus. This would allow for local communication between modules on the same bus segment without causing congestion to the rest of the bus. Again the disadvantages to this approach are its reduced flexibility and its complication of the design process. Various network interconnect approaches have been proposed for SOC’s and PSOC’s [4, 5, 6, 7]. Networks scale better and promise higher communication bandwidth than buses. Like buses, they allow the re-use of standard interface modules for connecting circuit nodes to the network. Networks-on-chip (NOC’s) can be divided into two categories – packet-switched and circuit-switched. In a packetswitched approach the data is broken into packets, each of which contains routing information. These packets are injected into the network where they are independently routed to the desired destination. In circuit switching, a dedicated connection path (a virtual circuit) between two nodes must be established before communication can take place. Once the virtual circuit is established, raw data can be freely transferred between the modules until the virtual circuit is no longer needed, at which time it can be closed. Packet-switched networks often allow for high aggregate system bandwidth since many packets can be in flight at a given instant. However, they generally require congestion control and packet processing, which includes buffers

191

to queue up packets awaiting the availablility of the routing resources. Circuit-switched networks are connectionbased, meaning that once a virtual circuit is established, high throughput data transfer between the modules can be guaranteed for as long as desired. They require no overhead for packetization, packet header processing, or packet buffering. As a result, the circuitry required for a circuit-switched network is relatively simple and appropriate for use in even modest-sized systems. Two problems associated with circuit switching have been mentioned as shortcomings. First, setup latency, the time required to build a virtual circuit, must be incurred before any communication between nodes can take place. In the system described here, this setup time is minimal and is similar to that required to route a single packet in other proposed systems. The second problem involves idle time on communication links – this will result when connections have been established but no transfers are taking place. This is not a major concern in our system – opening and closing connections is lightweight enough that there is little motivation for nodes to monopolize communication links by leaving them open for long periods of time. As an example of a packet-switched NOC, the CLICHE architecture [5] is a fixed 2-D mesh with one routing switch for each compute node. Although their architecture is highly scalable, the authors concede that their architecture is unsuitable for certain “heavy datafl ow applications” for performance reasons. As another example, [7] is targeted specifically to FPGA’s. It is also a 2-D torus and performs packet switching using wormhole routing. Of particular interest, it uses partial reconfiguration of the FPGA to support runtime dynamic module replacement. References [8] and [9] both provide strong arguments for circuit switching in NOCbased systems. The architecture proposed in [8] is a timedivision-multiplexed central switched network (crossbar) shared by all communicating nodes. SoCBUS [9] is a circuitswitched NOC organized as a fixed 2-D mesh and includes a routing switch for every compute node. Both of these references perform detailed simulations of circuit-switched NOC’s to show their throughput and relative advantages compared to packet-switched NOC’s. In this work we describe a detailed implementation of a fl exible and lightweight circuit-switched NOC for FPGAbased systems and quantify its area and performance benefits. Our proposed network, PNoC, was designed with three goals in mind. First, we wanted it to be a fl exible networking approach which would be applicable to a wide variety of system requirements. Flexibility was desired for both the allowable network topologies as well as the communication datapath widths. Second, we wanted a network that simplified system design by providing simple, standard network interfaces and easily understood network protocols. Third, we wanted our network to be lightweight, requiring

few FPGA resources, and thus suitable for both smaller and larger FPGA-based systems. 2. PNOC - A CIRCUIT-SWITCHED NOC FOR USE IN FPGA-BASED SYSTEMS The PNoC architecture is designed to be extremely fl exible. At design time it is possible to easily construct a variety of network architectures each with its own mix of system routing and computational resources. In addition, the network modules are parameterized for communication path widths, fl ow control, and timeout handling. At runtime, PNoC’s fl exibility supports the dynamic removal and insertion of nodes in the system, if supported by the FPGA fabric 1 . Router

Router Node A

1

2

1

2

3

4

3

4

5

7

5

7

6

8

6

8

Node B

Fig. 1. Example PNoC Topology The proposed network topology consists of a series of subnets where each contains a router and a collection of network nodes as shown in Figure 1. The routers perform the circuit switching between the nodes, and each node connects to a single router through a router port interface. A lightweight handshaking mechanism is used to establish dedicated connections between nodes, to exchange data, and to remove connections. To establish a connection, the requesting node (the master - Node A in the figure) asserts its request signal to the router while specifying the desired target node address on its address lines. The router determines which port is associated with the desired target node (the slave - Node B in the figure). In this example, the connection request is forwarded on to a second router since Node B resides in its subnet. The second router then processes the request to identify the port connected to the target node and determines if the target node is available. Once it becomes available, the second router informs the first router, who informs the master via a g ra n t signal, and the connection is established. The master and slave are then free to transfer data as desired. A write-followed-by-read sequence is shown in Figure 2. The write operation begins when m a ster tx v a lid goes high and m a ster tx rn w is held low. Since the data transfer occurs on a dedicated connection path there is no need 1 PNoC provides support for dynamic module replacement via routing table updates. However, the creation of partial reconfiguration bitstreams is outside the scope of PNoC’s functionality.

192

for an acknowledge signal. The slave signals in the bottom half of the figure refl ect what the slave sees during the transaction. A read request occurs in cycle 2 where the master tx v alid is high and master tx rn w goes high. The requested data is returned when master rx v alid goes high. As long as the master rx cts signal (not shown) is low, data transaction requests can occur on consecutive clock cycles. The figure also shows a second read request in cycle 3, illustrating that write and read requests can be pipelined. Cycle 1

Cycle 2

Cycle 3

master_tx_addr

A1

A2

A3

master_tx_data

D1

Signal Name

Cycle 4

Cycle 5

Cycle 6

master_tx_valid master_tx_rnw XXXX XXXX

master_rx_valid master_rx_data

XXXX

D3

D2

tus signals from the FIFO’s are provided as a part of the node connection to serve as end-to-end fl ow control signals. An important design goal was to produce a fl exible NOC. The network topology can be altered by how routers are interconnected in the system. Figure 3 shows two different systems that involve the use of one and two routers respectively. The interconnectivity between the modules and routers is fl exible and is defined by the system designer so that the network can meet specific system needs. Each router is parameterizable in the number of ports it contains and the width of the data and address lines contained in each port connection. Many routers could be used in a single system with a custom topology that best meets the system’s demands. This work presents the building blocks that make this possible rather than addressing the possible topolgies and their respective tradeoffs.

slave_rx_valid

Router

slave_rx_rnw slave_rx_addr

XXXX

A1

slave_rx_data

XXXX

D1

A2

A3

XXXX XXXX

slave_tx_valid slave_tx_data

XXXX

D2

D3

XXXX

Fig. 2. Master Node Write Followed By Read Network data transactions occur directly between the master and slave nodes. As can be seen, these transactions are similar to those for interfacing to pipelined memories – requests are sent and a set number of cycles later the data is returned. The router is not involved in read and write transactions except that (a) it provides the signal switching fabric so that master and slave can communicate, and (b) it provides one pipeline register in the switching fabric to improve the throughput (clock rate) of the pipelined data transfers. The master can remove a connection to another block by informing the router it no longer desires the connection through assertion of the release signal. Additionally, a p en d signal is supplied to the master to tell it when another node wants access to the slave node. The master may, at its discretion, prematurely close its connection in response. This behavior is not mandated by the network but the router functionality is provided to support it. The network infrastructure has been designed to support dynamic module replacement via partial configuration. If a node is removed from the system during execution using partial reconfiguration, its local router should be notified via a router update command which will remove that module from the system’s routing tables. When a new module is added to the system, an update command should be sent to its router to add it to system routing tables. All routers in the system operate using a common, synchronized clock rate. Each node, however, may operate at its own clock. FIFO’s are used between nodes and routers to provide for buffering of data as well as for crossing between the node’s clock domain and the routers’ clock domain. Sta-

Router

Router

1

2

1

2

1

2

3

4

3

4

3

4

5

6

5

6

5

6

7

8

7

8

7

8

Fig. 3. Two Different Network Topologies 2.1. The Routers The core parts of each router are its routing table and its signal switchbox. The routing table maps network module addresses to ports. The module address serves as the index to the table and the data stored at that index represents the port(s) that may be used to establish the connection path. PNoC allows the inclusion of multiple nodes with the same network address in a system. The router assumes that all nodes with address k are interchangeable and will use the first available such node to satisfy a connection request. This makes it possible to easily alter the mix of modules in processor-farm kinds of designs without the individual nodes being aware of the exact mix. This capability is exploited in the demonstration system described later in the paper. 2.2. The PNoC Module Interface One of the goals of this work is to facilitate the design of complex systems through modular design using a simple interface to the communication medium. Modules that connect to the network do so via a well-defined port interface which contains multi-bit transmit and receive data/address lines along with handshaking control signals. Figure 4 shows the hardware needed to effectively integrate a module with the network. On the left is the node circuitry itself – on the right is the network interface circuitry which consists of

193

transmit and receive FIFOs, and a simple FSM to communicate with the router.

Table 1. Router Implementation Results # Ports

Node Interface Router Control FSM

Hardware Task Block

Tx Fifo

Router

Data Width

Area (Slices)

Speed (MHz)

2 4 8

8 8 8

83 249 1113

160 151 138

2 4 8

32 32 32

131 366 1305

145 138 126

Rx Fifo

4. TEST APPLICATION Fig. 4. Node Interface Hardware Depending on the system timing characteristics, certain network modules may require the use of interface FIFOs. These FIFOs are strictly necessary in two cases: (1) the node runs at a clock rate different from that of its subnet router, or (2) the node, when acting as a slave, is unable to keep up with the data transmission rate of potential masters. In case (1) the FIFOs are used to cross between the node’s clock domain and the routers’ clock domain. In case (2) the almost full status fl ag on a receiving FIFO may be used by the master node for fl ow control purposes. The inclusion of FIFOs in the node interface is a parameterizable feature of the node interface design. A CPU can be readily connected to the PNoC like any other node. The implementation used in this work is a Xilinx MicroBlaze CPU combined with a custom memory mapped network interface.

The utility of the PNoC is shown here using a simple image binarization example. This algorithm uses hierarchical thresholding to quantize grayscale image pixels to binary black and white values. The computation involves computing median values at three different levels of hierarchy to be used as quantization threshold values. The algorithm involves the following steps: 1. Compute the median value for the entire image and use that to compute the global threshold value where global th r e s h = m e d ian + m e d ian / 4. 2. For each block of data in the image, determine its darkest pixel value and compare that against the global threshold. If it is lower (darker) than the threshold, then that block presumably contains valid data (it is called a valid block), and is processed further in step 3. Otherwise the entire block of data is set to a white value. 3. Each valid block is divided into smaller windows. Those windows are compared to a block threshold value and those which require additional processing are subjected to window processing in step 4. Otherwise, the window pixels are set to white.

3. IMPLEMENTATION RESULTS The PNoC building blocks described above have been implemented on a Xilinx Virtex-II Pro FPGA (xcv2p30-7). Design entry was done with JHDL [10]. The resulting NOC building block modules (the router, the node interface, and the CPU-node interface) are parameterized as described in the previous sections. Table 1 gives the area and speed results for a variety of router instances with differing numbers of ports and differing port data widths. In each case, the routing table is implemented using a single BlockRAM which is not refl ected in the table. The node interface circuitry (containing the FIFO’s from Figure 4) requires 155 slices and 2 BlockRAM’s. In cases where the FIFO’s are not required the area is reduced to 62 slices. The Microblaze CPU node interface circuitry, including the memory mapped network interface module, requires 196 slices and 2 BlockRAM’s.

4. Each pixel in a valid window is compared against the computed window threshold and set to black or white accordingly. 5. Steps 2-4 are repeated until every block has been processed and the complete quantized image has been produced and collected back to the CPU. This application, targeted to a Virtex-II Pro FPGA (xc2vp30-ff860-7), is illustrated in Figure 5 and consists of four module types. The Microblaze processor is the primary control for the system and computes the global threshold value for the image and manages the distribution of the image blocks to the block modules. The UART enables the uploading and downloading of the original and final images

194

Microblaze

UART

Block Module

Block Module Interconnect Medium

Block Module

Block Module

Window Module

Window Module

Fig. 5. Binarization Top-Level Modules between the FPGA and a host computer. The block modules compute a block level threshold value and, if valid data is detected within the block, divides the block into windows and sends these to the window modules. The window modules quantize each pixel of a window based on the window’s threshold value. The main design challenge in this system is in coordinating the transfer of image data between the different nodes. The major communications are between the CPU and block processors, and between the block and window processors. There are a different number of block modules and window modules due to the projected need for each kind of processing in the overall computation.

was not done in the second implementation. 4.2. Network Implementation The PNoC is well-suited to this type of system. Multiple block-to-window module data transfers can occur simultaneously as multiple connections can be active at a given instant. The dynamic routing capability of this network also plays an important role in this system. When a block module requires the services of a window module, that connection can be established with whichever window module becomes available first. No additional hardware is required by the system designer to poll for available window modules — the choice of which window module to use is made by the router. Further, if no window module is available the router will queue up connection requests in order until a window module is available. This allows for considerable fl exibility in the system — additional block and window modules can be added to an existing design and recompiled for execution without any changes being required of the block and window modules. Table 2. Binarization System Comparison Parameter Shared Bus Locked Bus PNoC Clock Rate Total Slices Cycle Count

108 MHz 2,852 198,113

98 MHz 2,864 20,919

124 MHz 3,685 9,977

4.1. Shared Bus Implementation Two bus-based implementations were completed using Xilinx EDK verson 6.3. Each contains a Microblaze processor and OPB (On-Chip Peripheral Bus) running at 100M Hz. The first implementation uses simple reads and writes to transfer data on the bus. This has the advantage of allowing other modules bus access during the computation but results in a slower implementation. The second implementation allows the block modules to lock the bus and burst the window data transfers. This results in a faster system but prevents other modules from using the bus during those transfers (essentially during the entire computation). In a bus-based system there is no built-in way of arbitrating or scheduling access to the window modules without designing custom arbitration into the modules themselves. In these implementations, each window module was designed to satisfy requests from two statically-chosen block modules. Similarly, there is no built-in way of scheduling the bus other than relying on bus arbitration for concurrent requests. Manually time-multiplexing the bus and manually scheduling access to the window modules to improve performance without locking the bus for extended periods of time would greatly complicate the design task and therefore

4.3. System Comparisons Table 2 compares the implementations. About 1,180 slices of the network design was for the 8-port router. Both designs were downloaded to the Xilinx XUP Virtex-II Pro Development Board. Times were recorded in such a way to remove software overhead on the Microblaze from the computation time. Blocks of data were first loaded into the four block modules and then the computation/communication time of the block and window modules was measured using a hardware timer. The experiments were set up to show maximum data transfer capability (all four block modules were competing for the services of the two window modules). Our original shared bus design used simple bus reads and writes for the data transfer, resulting in a 23 × performance advantage for the network version. By allowing the block modules to completely lock the shared bus and perform burst transfers, the performance difference was reduced to 2×. In the second version, the bus was completely locked for entire window computations, preventing any other activity between the CPU and other system modules from occuring. In contrast, the network version allows the CPU to communi-

195

cate with other system modules (such as the UART) during the computation. Interestingly, the clock rate of the network implementation was 27% higher. This is consistent with results we have seen for other applications we have completed, and we believe is due to the shorter wires in PNoC. The network links in the PNoC version achieved close to 100% of the theoretical bandwidth. The computation required the use of two window modules. At 124MHz each window module could conceivably maintain a 124MB/sec transfer rate with the block module it is connected to. Each achieved, on average, 119MB/sec, or 96% of the maximum bandwidth.

We desire to investigate the use of this capability in real applications where requirements change over time and necessitate changing the mix of modules in an embedded computing system. 6. REFERENCES [1] “Arm, amba specification,” ARM, Tech. Rep., 1999, revision 2.0. [2] “Coreconnect, coreconnect bus architecture,” IBM Cooperation, Tech. Rep., 1999. [3] E. Salminen, V. Lahtinen, K. Kuusilinna, and T. Hamalainen, “Overview of bus-based system-on-chip interconnections,” in Proceedings of the IEEE International Symposium on Circuits and Systems. ISCAS 02, 26-29 May 2002, pp. II–372 – II–375 vol.2.

5. SUMMARY AND FUTURE WORK In this paper we have proposed and demonstrated a fl exible, lightweight circuit-switched approach to constructing FPGA-based systems. It provides the ease of design (using standard interfaces) of a bus-based approach while providing performance that approaches that of direct interconnect. We believe it fl exible enough for use in general embedded systems and high performance enough for many highthroughput data fl ow applications. This first experiment has quantified the implementation cost of the basic PNoC modules on an FPGA and demonstrated their utility in a real application, at the same time showing the ease of design using PNoC as well as its potential performance. A number of directions for continued work remain. First, we want to explore the use of multiple routers and subnets in a system. We want to understand the kinds of topologies useful for various patterns of computing and communication. Also we want to investigate the applicable solution space for circuit-switched systems such as PNoC. Unfortunately at present there are no readily accessible packet-switched systems with which we can make meaningful comparisons. As they become available we would like to perform detailed analysis to understand the tradeoffs. Where are circuitswitched newtorks a better choice than packet-switched networks and buses? How do they compare for power consumption, clock rate, system throughput, and ease of design and debug? Finally, an important goal in the creation of PNoC was to support dynamic module replacement via partial reconfiguration. As mentioned, PNoC provides support for adding and deleting modules to a system.

[4] W. J. Dally and B. Towles, “Route packets, not wires: Onchip interconnection networks,” in Proceedings of the Design Automation Conference. DAC 01, 18-22 June 2001, pp. 684–689. [5] S. Kumar and A. Jantsch, “A netork on chip architecture and design methodology,” in Proceedings of the IEEE Computer Society Annual Symposium on VLSI. ISVLSI 02, 25-26 April 2002, pp. 105–112. [6] C. Grecu, P. P. Pande, A. Ivanov, and R. Saleh, “A scalable communication-centric SoC interconnect architecture,” in Proceedings of the 5th International Symposium on Quality Electronic Design, 2004, pp. 343–348. [7] T. Marescaux, A. Bartic, D. Verkest, S. Vernalde, and R. Lauwereins, “Interconnection networks enable fine-grain dynamic multi-tasking on FPGAs,” in Proceedings of the 12th International Conference on Field-Programmable Logic and Applications. FPL 02, September 2002, pp. 795–805. [8] J. Liu, L.-R. Zheng, and H. Tenhunen, “A Circuit-Switched Network Architecture for Network-on-Chip,” in Proceedings of the International Symposium on System-on-Chip, September 2004, pp. 55–58. [9] D. Wiklund and D. Liu, “SoCBUS: Switched Network on Chip for Hard Real Time Embedded Systems,” in Proceedings of the International Parallel and Distributed Processing Symposium, April 2003. [10] B. Hutchings, P. Bellows, J. Hawkins, S. Hemmert, B. Nelson, and M. Rytting, “A cad suite for high-performance fpga design,” in Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines, K. L. Pocek and J. M. Arnold, Eds., IEEE Computer Society. Napa, CA: IEEE, April 1999, p. n/a.

196