A Buffer-space Allocation Approach for Application-specific Network

Simulations are conducted and preliminary ... architectures are based on either simulation or analytical mod- els [7], [13] ... be used to model systems that are governed by a law of ... moving in a Bose gas and migrating between various energy.
489KB taille 2 téléchargements 264 vues
A Buffer-space Allocation Approach for Application-specific Network-on-Chip M. Bakhouya, A. Chariete, J. Gaber, M. Wack Universite de Technologie de Belfort-Montbeliard 90010 Belfort, France {bakhouya,chariete,gaber,wack}@utbm.fr

Abstract—Rapid advances in technology and design tools enabled today engineers to design system-on-chip containing large number of cores. These systems have limited resources and should be implemented with very little silicon area overhead. Several studies have demonstrated that buffers inside switches of the on-chip interconnect take a significant portion of the system silicon area that can affects the performance and the energy consumption. Therefore, their size should be carefully customized to match communication patterns of a target application. In this paper, a compartmental Fluid-flow based modeling approach is presented to allocate required resource for each buffer based on the application traffic pattern. Simulations are conducted and preliminary results are reported to show the efficiency of the Fluid-flow based modeling method for a buffer space allocation.

I. I NTRODUCTION The expansion of integration technology provides today opportunities for designing System-on-Chip (SoC) with a large number of cores (e.g., processing elements) connected by an on-chip interconnect architecture. These systems have emerged as a key technology behind most embedded and smart miniaturized systems, especially for telecommunication and multimedia applications, to provide high flexibility and better performance. A key element in the performance and energy consumption in SoCs is the on-chip interconnect, which links all devices that are created in the chip infrastructure. The on-chip communication architectures used in current SoCs, such as IBM CoreConnect [1] and Micronetworks [2], to integrate components or cores are based on bus schemes. With the increasing complexity of SoC and its communications requirement, Network-on-Chip (NoC) has emerged as a solution of non-scalable shared bus schemes [3], [4], [5], [6]. Examples of NoC architectures are 2D Mesh, Torus, Spidergon, Octagon, and WK [7], [5]. NoC architectures consist of a number of interconnected IP cores (e.g., CPU, DSP, memories) that communicate via an interconnection network (i.e., onchip interconnect). They are characterized by different tradeoffs with regard to latency, throughput, communication load, energy consumption, and silicon area requirements. NoC systems have limited resources, compared to parallel and distributed systems, and should be implemented with very little area overhead. Furthermore, their increasing complexity makes their design, prior deployment, extremely challenging. The design of flexible, scalable and reliable on-chip communication architectures that meet the constraints and requirements

of today’s SoCs applications is required. Recent studies have demonstrated that buffers take a significant portion of the silicon area of the NoC [8], [9], [10], [11], [12]. Their size should be carefully customized for each link/channel input to match communication pattern of a specific application. In other words, oversizing buffers can affects, the performance, the energy consumption and the silicon area requirements. It is, therefore, useful to perform a traffic analysis in the early design stages by allowing the designer to select the appropriate buffer size for each channel. Performance evaluation and analysis of on-chip interconnect architectures are based on either simulation or analytical models [7], [13], [14], [15], [16]. Generally, the simulation is timeconsuming and provides little insight on how different design parameters affect the actual NoC performance. Analytical models, however, can allow a fast evaluation of performance metrics of large systems in early design process. Several works, such as presented in [17], have demonstrated that there is a crucial need for system design tools and methodologies to analytically evaluating NoC architectures. In this paper, a compartmental Fluid-flow based method, inspired by the work presented by Guffens et al. in [18], is introduced for analytical evaluation of on-chip interconnect architectures. This theory has been used in different domains mainly in biology and physical sciences [19], [20]. It may be used to model systems that are governed by a law of mass conservation and whose state variables are constrained to remain non-negative [18], [21]. This paper focuses mainly on the practical use of this method for customizing buffers size given a candidate on-chip interconnect architecture and an application workload. In other words, the main objective of this methodology is to allow designers to customize buffers space that best suit the needs of a particular application workload. The rest of this paper is organized as follows. In section 2, methods proposed for buffer space allocation is presented. Section 3 describes the Fluid-flow modeling approach for onchip interconnect architectures. Section 4 presents a case study considering the 2D Mesh on-chip interconnect together with the preliminary performance evaluation results by showing the effectiveness of this modeling approach. Conclusions and future work are given in section 5.

II. R ELATED W ORK Analytical methods for buffer space allocation in NoC router design can be classified into four main categories: deterministic approaches, stochastic approaches, physic-based approaches, and system theory-based approaches. In the first category, approaches are mainly based on graph theory. For example, in [22], a model using a cyclo-static dataflow graph was used for buffer dimensioning for NoC applications. To show the effectiveness of this model, the analytical results were compared with those extracted from the simulation. However, deterministic approaches assume that the designer have a deep knowledge about the pattern of communication among cores and switches. Most work to date using probabilistic approaches are based on queuing theory. In [10], an algorithm was proposed to automatically assign the buffer depth for each input channel to match communication pattern of a target application and given buffering space budget. Most queuing approaches consider incoming and outgoing traffics as probability distributions (e.g., Poisson traffic). However, NoC applications exhibit traffic patterns that are very different compared to Poisson model as in queuing theory [23], [4]. More precisely, the Poisson model fails to capture some important network characteristics like self-similarity or long-range dependence [24]. In [25], authors suggested statistical physics and information theory to study the buffers behavior. Statistical physics can model the interactions among various components while taking into consideration the long-term memory effects. The main concept in this model is that packets in the network move from one node to another in a manner that is similar to particles moving in a Bose gas and migrating between various energy levels as a result of temperature variations. More precisely, the approach involves a virtual random growing network, which describes the NoC buffers evolution as a function of the packet injection rate. Performance analysis of the proposed model predicts that the buffer occupancy follows a power law distribution. The fourth category uses system theory that is successfully applied to design electronic circuits. In particular, Network Calculus [26], [27] is inspired from this theory for modeling and evaluating the buffers size of each switch. The attractive feature of Network Calculus is its ability to capture all traffic patterns with the use of bounds. Based on shapes of the traffic flows, designers are able to capture some dynamic features of the network. For example, in [28], Network Calculus was used to analyze and evaluate performance metrics of onchip interconnects. It was demonstrated that Network Calculus theory can be used to find buffers size of each channel that best suit the traffic characteristics. However, Network Calculus was used for worst-case analysis and can provide buffers depth and latency upper-bounds. Another category for analyzing and studying the behavior of dynamic systems is compartmental Fluid-flow theory that takes into account the stochastic nature of the input traffic flows. In NoC field, a state space model, based on compartmental Fluid-

flow theory, for NoC with state observer controller is proposed in [29]. Authors focused mainly on controlling the input and output flow rates and monitoring the intermediate flow rates from the on-chip routers in order to alleviate congestion and stabilize the network. However, the processing service at onchip routers was not considered this model. The work presented in this paper is inspired from the modeling approach proposed in [18], [30], [31] for macro computer network analysis and evaluation by also considering the processing rate at each router. This concept of processing rate was also used in Network calculus method in which the maximum data flow that can be sent by each switch is constrained by the arrival flow and its average processing rate. We mainly show how the compartmental Fluid-flow theory can be used for studying and analyzing the non-linear behavior of NoC applications. In particular, the Fluid-flow model is used as a design space exploration methodology for customizing or tuning buffers size of on-chip interconnect architectures given a particular application traffic pattern. III. F LUID -F LOW MODEL A NoC system can be viewed as a distributed system composed of autonomous nodes that communicate by exchanging messages through an on-chip interconnect. As shown in Figure 1, there are three important elements in NoC: cores, routers (or switches), and bidirectional links. Each core can be either source or sink, in which packets are constructed or consumed. A switch is composed of server to process incoming packets and direct them to the neighboring switches or local cores. Because of limited processing capacity of the router or switch, incoming packets must be stored in local buffers before their transmission. Each ingress port in the switch has a buffer for temporary storage. When a packet arrives at a switch, it must go into the buffer that corresponds to a Drop-tail queue with FIFO queue management mechanism. Buffers are required to absorb differences in switches speed and burstiness traffic exchanged between the cores.

Fig. 1.

Switch structure and data flows exchanged between selected cores

The communication between two cores is characterized as flows that are represented by sequences of hops. For example, in Figure 1, flows f1 , f2 , f3 , f4 , and f5 represent

communication flows between source cores (c2 , c5 , c6 , c9 ) and cores (c1 , c3 , c7 , c8 ) considered as sinks. The traffic model of a switch node si is shown in Figure 2. A switch si has input flows λki from q local cores ck , 1 ≤ k ≤ q and k 6= i, and input flows αji from neighboring switches sj (i.e., the outputs of sj ), 1 ≤ j ≤ n, j 6= i. The output flows of si are eik to p local cores ck , 1 ≤ k ≤ p and k 6= i, and αi` output flows to neighboring switches sj , 1 ≤ ` ≤ m and ` 6= i. The compartmental Fluid-flow model can be then expressed as follows:

x˙i =

q X

λki +

k6=i

n X

αji (x) −

j6=i

p X

eik (x) −

k6=i

m X

αi` (x)

(1)

where x1 (t), x2 (t), ..., xns (t), called the state vector of the system, represent the total number of packets waiting or under processing at switches. This means that the accumulated packets in the input buffers of a switch or core is the difference between the total input flows (the first and second terms of Eq.1) and the output flows (the third and fourth terms of eq.1). To express the output flows, the concept of processing rate function, proposed in [18], can be used and Eq.1 can be then rewritten as follows:

x˙i =

q X k6=i

λki +

n X j6=i

aji rj (x)−

p X

aik ri (x)−

k6=i

m X

ai` ri (x) (2)

`6=i

where ri (x) are processing rate functions, which can be xi as an explicit factorization of xi , ρ > 0, expressed by µi ρ+x i and µi is the service rate assumed to be lower than the maximal transmission capacity Ri (bandwidth) of the outgoing links. These functions should be bounded, continuous and differentiable, with ri (0) = 0 and 0 ≤ ri (x) ≤ µi , ∀xi > 0. The parameter µi is the service rate of the ith switch. The parameters aik , ai` , and aji are positive values that represent the fraction of packets that are submitted Pp on the linkPim→ k, i → `, and j → i respectively, where k6=i aik + `6=i ai` = 1. Using matrix expression, Eq.2 can be rewritten in a compact form as follows: x˙ = G(x)x + ϑ

Fig. 2.

Input/output traffic of the switch si

`6=i

(3)

where x is the state vector with elements xi (1 ≤ i ≤ ns), and ϑ is the input vector with the elements λki such as 1 ≤ k ≤ q. G(x) is a matrix with the following properties [21]: 1) G(x) is a Metzler matrix with non-negative off-diagonal entries, which aji µi are either 0 or gij (x) = ρ+x , i 6= j; 2) the diagonal elements i Pp Pm i` µi ik µi are non positive, gii (x) = − k6=i aρ+x − `6=i aρ+x ; 3) i i Pm |gii (x)| ≥ j6=i gji (x), i.e., the matrix is diagonally dominant; 4) G(x) should be no singular and stable. This propriety can be verified by checking if the network is fully outflow connected, i.e., for each node i, there exist a path from i to another node k from which there is an outflow. For example, the network depicted in Figure 1 is fully outflow connected because from any node source there is a path to another node with an outflow (sink).

IV. E VALUATION STUDY In this section, we show the practical use of compartmental Fluid-flow modeling approach to assign the buffer space of each input channel. Results are computed with both analytical evaluation and simulation. We analyze particularly the maximum buffer size needed to store packets waiting or under processing. For this evaluation, we consider a 4x4 2D Mesh on-chip interconnect as a case study (see Figure 3). The application is represented as communicating parallel processes already mapped into the cores. Cores selected to be traffic sources are c1 , c2 , c3 , c4 , c5 , c9 , and c13 and the cores selected to be sinks are c4 , c8 , c12 , c13 , c14 , c15 , and c16 . Data flows are computed using a deterministic routing protocol to direct flits between sources and sinks selected as follows,(c1 , c4 ), (c1 , c13 ), (c2 , c14 ), (c3 , c15 ), (c4 , c16 ), (c5 , c8 ), (c9 , c12 ), (c13 , c16 ). As shown in Figure 3, seven data flows are computed, e.g., f1 = (c1 , s1 , s5 , s9 , s13 , c13 ), f2 = (c2 , s2 , s6 , s10 , s14 , c14 ). After defining data flows and nodes participating in transmitting and/or receiving data (see 3), the Fluid-flow model can be described by interconnecting all arrival and output flows. For example, x˙2 = λ2 + (1 − a1 )r1 (x1 ) − (1 − a2 )r2 (x2 ) − a2 r2 (x2 ), where λi is the injection rate at source cores ci and ai = 0.5 ∀i. It’s worth noting that, in 2D Mesh, there is only one core linked to each switch, but for Fat-Tree or Butterfly-Fat-Tree on-chip interconnects, switches in the leaf have more than one core and other ones are connected only to other switches. To show the efficiency of this approach, we compared the analytical results with a detailed simulation of the system using the same traffic pattern, i.e., a specified target application. A discrete event driven simulator, presented in [32] and developed using Omnet [33], is used in this evaluation study. The links bandwidth R is configured to be 200 flits/s because of the amount of resources required for each simulation instance. Each process is linked with a traffic generator that injects flits according to the memoryless distribution (Poisson sources) with an average of 12.5ms, 16.6 ms, 25ms, 50ms, which equivalent to the injection rates 80 flits/s, 60 flits/s, 40 flits/s, and 20 flits/s respectively. Buffers size inside cores are not computed in this evaluation, we assume that cores have enough

Fig. 5. The buffers size variation over time computed with simulation and analysis when the injection rate is 40 flits/s

Fig. 3. Data flows used in the compartmental Fluid-flow model of 4x4 2D Mesh on-chip interconnect

space and speed to handle and process the arrived data. In the switches, each port has a buffer for temporary storage of received packet. The simulation time is fixed to 10s. Figures 4, 5, 6, and 7 show the buffers size variation when injection rates are fixed to 20, 40, 60, and 80 flits/s respectively. The results depicted in these figures demonstrate that as the injection rate increases more buffers space is needed to avoid flits from being dropped. As the injection rate increases, the network becomes more congested with heavy traffic and so more space is needed to absorb differences in switches processing rate and traffic burstiness. We can also see that analytical results are in the same order of magnitude as those obtained by simulations, i.e., simulation results are in accordance with the analytical model, both show a deviation of less than 10% on average.

Fig. 6. The buffers size variation over time computed with simulation and analysis when the injection rate is 60 flits/s

Fig. 7. The buffers size variation over time computed with simulation and analysis when the injection rate is 80 flits/s

Fig. 4. The buffers size variation over time computed with simulation and analysis when the injection rate is 20 flits/s

We have also compared the Network Calculus model with Fluid-flow model and simulations under different injection

rates. After defining data flows and nodes participating in transmitting and/or receiving data, the entire network is described to obtain the Network Calculus-based model by merging all arrival and output flows. Results depicted in Figure 8 show that the Fluid-flow model provides more accurate estimation of buffers size than Network Calculus, which provides the worst-case analysis (upper bounds).

Fig. 8. The maximum buffers size evaluated using Fluid-flow model, Network Calculus, and Simulations

V. C ONCLUSIONS AND F UTURE W ORK In this paper, a design space exploration methodology based on compartmental Fluid-flow model is introduced for buffer space allocation in NoC routers design. The objective is to allow designers, in the early stage of the design process, to rapidly analyze and allocate the required buffers space for each channel. By analyzing and capturing the characteristics of onchip communication traffic, the designer can select and design the on-chip interconnect routers that are optimized for a target application. The objective of this work is to first build a framework based on compartmental Fluid-flow theory for applicationspecific NoC. Although this study considered a small size NoC, ongoing work addresses the scalability issue for large scale NoCs including applications with complex data flows (e.g., uniform, locality and hotspot traffic) and compares the proposed technique with other buffer sizing approaches from literature. Because available resources should be shared between multiple connection, congestion or bottlenecks may be created in some switches because of resources insufficient, and therefore leading in poor performance. The compartmental Fluid-flow modeling framework will be used to analyze and avoid congestion by including control mechanisms to guarantee the boundedness of the buffers size defined at design-time stages. R EFERENCES [1] R. Hofmann and B. Drerup, “Next generation coreconnect processor local bus architecture,” IEEE ASIC/SOC Proc., pp. 221–225, 2002. [2] D. Wingard, “Micronetwork-based integration for socs,” DAC Proc., pp. 673–677, 2001. [3] S. Kumar, A. Jantsch, J.-P. Soininen, M. Forsell, M. Millberg, J. berg, K. Tiensyrj, and A. Hemani, “A network on chip architecture and design methodology,” Proc. Int’t Symp. VLSI (ISVLSI), pp. 117–124, 2002. [4] U. Y. Ogras, J. Hu, and R. Marculescu, “Key research problems in noc design: A holistic perspective,” CODES+ISSS Proc., 2005. [5] S. Suboh, M. Bakhouya, J. Gaber, and T. El-Ghazawi, “An interconnection architecture for network-on-chip systems,” Telecom. Systems, vol. 37, no. 1-3, pp. 137–144, 2008. [6] L. Benini and G. D. Micheli, “Networks on chips: A new soc paradigm,” IEEE Computer, vol. 35, no. 1, pp. 70–78, 2002.

[7] P. P. Pande, C. Grecu, M. Jones, A. Ivanov, and R. Saleh, “Performance evaluation and design tradeoffs for network-on-chip interconnect architectures,” IEEE Trans. on Computer, vol. 54, no. 8, pp. 1025–1040, 2005. [8] A. Balkan, G. Qu, and U. Vishkin, “A mesh-of-trees interconnection network for single-chip parallel processing,” ASAP Proc., 2006. [9] M. Coenen, S. Murali, A. Ruadulescu, K. Goossens, and G. D. Micheli, “A buffer-sizing algorithm for networks on chip using tdma and creditbased end-to-end flow control,” CODES+ISSS Proc., 2006. [10] J. Hu, U. Y. Ogras, and R. Marculescu, “System-level buffer allocation for application-specific networks-on-chip router design,” IEEE Tran. on Computer-Aided Design Of Integrated Circuits And Systems, vol. 25, pp. 2919–2933, 2006. [11] I. Saastamoinen and J. N. M. Alh, “Buffer implementation for proteo networks-on-chip,” Proc. Int. Symp. Circuits and Syst., p. 113116, 2003. [12] M. Bakhouya, “Evaluating the energy consumption and the silicon area of on-chip interconnect architectures,” Journal of Systems Architecture, vol. 55, no. 7-9, pp. 387–395, 2009. [13] S. Suboh, M. Bakhouya, S. Lopez-Buedo, and T. El-Ghazawi, “Simulation-based approach for evaluating on-chip interconnect architectures,” SPL Proc., pp. 75–80, 2008. [14] J. Hu and R. Marculescu, “Application-specific buffer space allocation for networks-on-chip router design,” Proceedings of the 2004 IEEE/ACM International conference on Computer-aided design, pp. 354–361, 2004. [15] U. Y. Ogras and R. Marculescu, “Analytical router modeling for networks-on-chip performance analysis,” DATE Proc., pp. 1–6, 2007. [16] L. Wang, Y. Cao, X. Li, and X. Zhu, “Application specific buffer allocation for wormhole routing networks-on-chip,” NoCarc08, MICRO41, 2008. [17] K. Lahiri, S. Dey, and A. Raghunathan, “Evaluation of the trafficperformance characteristics of system-on-chip communication architectures,” VLSI Design Proc., p. 29, 2001. [18] V. Guffens, G. Bastin, and H. Mounier, “Fluid flow network modeling for hop-by-hop feedback control design and analysis,” Proceedings Internetworking, 2003. [19] J. Jacquez and C. Simon, “Qualitative theory of compartmental systems,” SIAM Review, vol. 35, no. 1, pp. 43–79, 1993. [20] ——, “Qualitative theory of compartmental systems with lags,” Mathematical Biosciences, vol. 180, pp. 329–362, 2002. [21] G. Bastin, “Sur la modlisation et le contrle des rseaux dynamiques conservatifs,” Revue E-STA, Special CIFA 2006, vol. 3, no. 2, 2007. [22] A. Hansson, M. Wiggers, A. Moonen, K. Goossens, and M. Bekooij, “Applying dataflow analysis to dimension buffers for guaranteed performance in networks on chip,” NOCS Proc., pp. 211–212, 2008. [23] G. Varatkar and R. Marculescu, “Traffic analysis for on-chip networks design of multimedia applications,” DAC Proc., pp. 510–517, 2002. [24] R. Marculescu and P. Bogdan, “The chip is the network: Toward a science of network-on-chip design,” Foundations and Trends in Electronic Design Automation, vol. 2, no. 4, pp. 371–461, 2007. [25] P. Bogdan and R. Marculescu, “Quantum-like effects in network-onchip buffers behavior,” in Proceedings of the 44th Design Automation Conference (DAC), pp. 266–267, 2007. [26] J.-Y. L. Boudec and P. Thiran, “Network calculus: A theory of deterministic queuing systems for the internet,” LNCS 2050, 2001. [27] R. L. Cruz, “A calculus for network delay, part ii: Network analysis,” IEEE Tran. on Information Theory, vol. 37, no. 1, pp. 132–141, 1991. [28] M. Bakhouya, S. Suboh, J. Gaber, and T. El-Ghazawi, “Analytical modeling and evaluation of on-chip interconnects using network calculus,” NoCS Proc., pp. 74–79, 2009. [29] V. K. Sehgal and D. S. Chauhan, “State observer controller design for packets flow control in networks-on-chip,” Journal of Supercomputing, vol. DOI 10.1007/s11227-009-0322-5, 2009. [30] A. Pitsillides, P. Ioannou, M. Lestas, and L. Rossides, “Adaptive nonlinear congestion controller for a differentiated-services framework,” IEEE/ACM Transactions on Networking, vol. 13, no. 1, pp. 94–107, 2005. [31] D. Tipper and S. M. K., “Numerical methods for modeling computer networks under nonstationary conditions,” IEEE Journal on Selected Areas in Communications, vol. 8, no. 9, 1990. [32] V. Guffens, “Compartmental fluid-flow modelling in packet switched networks with hop-by-hop control,” PhD thesis, 2005. [33] A. Varga, “Omnet++ simulator,” Omnet++ simulator available at http://www.omnetpp.org/.