A low-overhead dedicated execution support for stream applications

Oct 12, 2012 - ware resources and task execution, a necessity to achieve a negligible CPU and ... tion through shared memory vs. communication through.
632KB taille 3 téléchargements 282 vues
A Low-overhead Dedicated Execution Support for Stream Applications on Shared-memory CMP Paul Dubrulle, Stéphane Louise, Renaud Sirdey, Vincent David CEA, LIST Point Courrier 172 91191 Gif-sur-Yvette Cedex, France

@cea.fr

ABSTRACT The ever-growing number of cores in Chip Multi-Processors (CMP) brings a renewed interest in stream programming to solve the programmability issues raised by massively parallel architectures. Stream programming languages are flourishing (StreaMIT, Brook, ΣC, etc.). Nonetheless, their execution support have not yet received enough attention, in particular regarding the new generation of many-cores. In embedded software, a lightweight solution can be implemented as a specialized library, but a dedicated microkernel offers a more flexible solution. We propose to explore the latter way with a Logical Vector Time based execution model, for CMP architectures with on-chip shared memory.

Categories and Subject Descriptors D.4.1 [Operating Systems]: Process Management—scheduling,synchronization; D.1.3 [Programming Techniques]: Concurrent Programming—parallel programming

General Terms Performance, Reliability, Algorithms, Theory, Experimentation, Measurement.

Keywords Execution model, logical vector time, micro-kernel, chip multiprocessor, manycore, stream programming.

1. 1.1

INTRODUCTION Motivations

With the advent of multi-core processors as a pervasive reality, from supercomputing to embedded systems, programming concepts must also evolve toward new paradigms as massive parallelism in programs is becoming mandatory. This means that imperative programming does not fit this new area well. Several concepts of parallel programing are

competing for the status of being the most adapted language paradigm for many-core systems. Among them, a number of stream programming languages are emerging (StreaMIT [24], Brook, ΣC [12], . . . ). The advantages of stream programming relies for a part in their theoretical bases which make them amenable to formal verification of important application properties like dead-lock freeness, execution within limited memory bounds, or correctness of parallel applications including functional determinism, or absence of race conditions [18]. Even though stream programing does not fit all computation kinds, it is very well fitted to signal and image processing applications, which are among the killer applications for many-core systems. The bases of stream programing rely on Kahn Process Networks (KPN [15]), more precisely on their special derivation, Data Process Networks [18], as well as their more restrictive variants such as Cyclo-Static Data Flows (CSDF, [10]). KPN and CSDF are deterministic and the possibility to run a CSDF in bounded memory is a decidable problem [4]. Most of current implementations of execution support for stream programming languages are threading libraries above off-the-shelf OS. This is a correct approach as a validation tool or for running streaming applications on full-size computers. However, in the embedded world, a lighter approach is needed as memory is usually a scarce facility, and embedded targets require special attention to power consumption and code performance. A specialized micro-kernel would potentially provide a very light execution support with little overheads, along with more options for scheduling policies. It is an adequate answer for stream applications on multi-core and many-core embedded platforms as it provides a strict control over hardware resources and task execution, a necessity to achieve a negligible CPU and memory overhead.

1.2

Target architectures

The choice of the target architecture is of first importance regarding the design of a specialized micro-kernel. We choose to target multi-core systems with shared on-chip memory. This memory can either be a specialized local storage shared among several cores, or be a shared L2 or L3 on-chip cache. A specialized micro-kernel targeting such architectures is scalable to embedded many-core chips, most such architectures being clustered (hence one instance of the micro-kernel runs on each cluster, which typically is a small SMP - cf. Figure 1). When it is not so, the set of cores can be partitioned. These architectures are well fitted for stream program-

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. EMSOFT’12, October 7-12, 2012, Tampere, Finland. Copyright 2012 ACM 978-1-4503-1425-1/12/09 ...$15.00.

143

Shared memory

Shared memory

C0 ... Ci

C0 ... Ci

are statically determined between cores, and tasks running on the same core are fused together by several compilation techniques (parallel fusion for split-join parallelism and vertical fusion for successive filters). This provides an overall static scheduling of the parallel application on the RAW multi-core target. This work is the closest to what we aim to do with multicore architectures. Nonetheless, the RAW architecture is quite different from our main target architecture since it does not provide a shared memory for all the cores. Moreover, the paper provides mostly performance benchmarks for an application on RAW making it quite difficult to compare with our approach on task switch latencies: on one hand the fusion and static scheduling means that there is no real task switch on a given core, on the other hand benchmark program results are also hard to compare because the RAW architecture is not comparable to ours (communication through shared memory vs. communication through core specific data channels, etc.). Anyway, a static scheduling as implemented in support of StreaMIT for RAW is based on measures of user task execution times. If these execution times tend to vary a lot during execution (in dynamic applications such as video encoders for example), the static scheduling is much less efficient. A dynamic scheduling, as in our approach, takes advantage of early release of processing cores. Another paper [29] shows an implementation of an execution layer for StreaMIT on the Cell BE processor. The Cell BE processor is a heterogeneous platform with one front-end dual threaded PPC core (with L1 and L2 cache) and 8 SPE core (which are mostly vectorial processors) with 256 KB of per core local on chip memory (local store). All the cores are connected through a double ring high speed NoC. The Multicore Streaming Layer described in this paper is close by its principles to ours. The aim is to have an execution abstraction layer that would be valid both in shared-memory multi-core and local-memory multi-core. Nonetheless, the implementation is done for a local memory multi-core (CELL BE). Though the implementation we present in this paper is for shared-memory, its is ready to be extended to local memory paradigm (cf. subsection 1.2) by using double buffers to manage communication transparently between on-Chip local memory: the basic principle is to replace the shared-memory buffer with a source buffer and as many copy buffers as required by the application (depending on task clustering and assignation on cores or groups of cores). The main difference with our approach is that Zhang approach provides a framework for dynamically running task and managing buffers (and send data to depending task when it is ready). Our approach is more static, and allows for off-line verifications whereas their approach depends on the correct firing of commands (or groups of commands) to the runtime library, especially in the dynamic scheduling scheme. Since this framework relies on Linux OS for CELL BE and is not a bare-bone implementation, the filter run command is said to take a few hundreds of micro-seconds which is large compared to our task selection/commutation time. The benchmarks are, as in the previous case, hard to compare because the targeted architecture and software approach are quite different.

Shared memory

... Processor core

P0

Processor core

P1

C0 ... Ci Processor core

Pj

Figure 1: A multi-core micro-kernel is scalable to many-core by partitioning the many-core and executing an instance of the micro-kernel per partition; the multi-core partitions P0 to Pj each host an instance of the micro-kernel, the double arrow represents the need of communication between partitions (e.g. partitions could be clusters and the double arrow a NoC for a hierarchical many-core target).

ing paradigms because channels between processes can be implemented as shared memory buffers (with low latencies) between processes running in parallel. If buffer sizes are correctly adjusted, the producing process in a channel can run in parallel with the consuming process(es), because the latter can work on data in a section already written by the producing process at the previous step of execution. Moreover, such architectures allow for a global and dynamic scheduling of the tasks instead of a static scheduling, as what is often met nowadays: a shared task list in on-chip memory theoretically allows for a dynamic execution which can absorb variations in single task execution times, hence taking advantages of early release of processing cores/units to gain either on the average execution time or on the typical latency of a given multitasking application (or both).

1.3

Goals

We aim at providing a road-map from a stream programing language to an execution model on a multi-core with shared on-chip memory, with the following constraints: 1. light weight implementation of a specific micro-kernel, for the reasons expressed in subsection 1.1; 2. as low an overhead as possible to minimize the impact of the runtime on user computation; 3. constrained dynamic scheduling, because of its advantages on the chosen target platforms, as said in subsection 1.2. For us the second constraint is nearly as important as the first, this is why we choose to focus on an asymmetric implementation of a multi-core micro-kernel. One execution resource is dedicated to task management, while inter-process communication and computation are left to the others.

2. 2.1

RELATED WORKS StreaMIT execution support

StreaMIT provides a support for the RAW multi-core architecture, as described in [2]. In this work, communications

144

2.2

Array-OL

The concept of vector clocks has been indirectly used before in many application fields (file consistency [22], distributed debugging [8], distributed mutual exclusion [23], . . . ). Yet the capability of vector clocks to capture causal relations between events has never been used for dynamic scheduling. It is quite adapted because it allows an efficient, deterministic and dynamic task status update at runtime, as exposed in the technical part of this paper in subsection 5.2. A major drawback, though, is the space such vectors use in memory, especially for embedded environments, but this problem was also addressed in the implementation of our prototype (cf. subsection 5.3). Using the LVT execution model as a library over a POSIX system was our validation approach, but the present microkernel approach has always been our goal. A specialized micro-kernel uses much less resources (memory, CPU time) than a general purpose kernel, which is critical in the embedded platforms we target. It also avoids interactions between the user space scheduler (LVT based) and the host system’s scheduling policy, which may decide a preemption without considering the impact on the application latency.

Array-OL is a Domain Specific Language which, by its constructs, can be easily parented or mapped to stream processing models of execution. Two works were specifically proposed toward this goal. The first one mapped the language constructs to Khan Process Network (KPN) constructs [1]. It shows that ArrayOL paradigms are easily projected onto KPN where they used arrays as basic tokens. KPN were then implemented with pipes on top of a CORBA framework. The aimed target systems are heterogeneous computing farms (networks of workstations) and MultiProcessor System on Chip (MPSoC). Experiments where only conduct for the former target, so comparisons with our work are hard to make (networks of workstations are not shared memory systems). Moreover, performance should heavily depend on the CORBA implementation and no details are provided regarding how should such an implementation for MPSoC be achieved. The second one applied a closely related technique with Multidimensional Synchronous Dataflows (MSDF) [7]. The model mapping is very close to the previous. Nonetheless, the evaluations are only made on the Ptolemy platform [17] which is a good target for research work, but would not provide real ground to compare with a specialized execution layer.

2.3

4.

Brook

The Brook language [3] is an extension of standard C that provides notions of stream types. Streams are processed by kernels (i.e. specialized functions). Specialized functions and kernels provide the tools for splitting, joining or decimating stream values (so communication framework). Brook relies on a source to source compiler to generate e.g. code for GPGPU targets. It uses a generic runtime called Brook RunTime library (BRT) as execution support. Its main aim is to facilitate the transformation of legacy C code to exploit the power of GPU by accelerating the execution of compute intensive parts of the original code onto the GPU. Although one of the long term goal of Brook is to support embedded multi-core target like PicoChip or RAW, nowadays environment is focused on general purpose workstation with GPU accelerators. Only the compute intensive kernels are accelerated and even the communication (scatter and gather especially) parts are mapped on the CPU since current GPU are not well fitted to these kinds of processing. Therefore the comparison with our work is hard to achieve, even on the pure stream parts of a Brook application. Several other extension of C or C++ language families have been proposed to provide a mean to program GPU. One can cite openCL, CUDA or Cg. They are proposed for vertex shader or GPGPU kernel programming. But they remain low level compared to stream programming paradigms.

3.

STREAM PROGRAMMING

Before describing our execution model and micro-kernel, we describe the stream programming paradigm and the task model it supports. Stream programming paradigm is based on the following two elements. First, a series of filters which are computing units that take values as entries on specified read-only channels, use these values for processing and output values (obviously computation results) on predeclared write-mostly channels. Reading on input channels of filters is blocking. Output channels are theoretically not limited in size, but of course a desired property of the system is that it is amenable to run in finite memory. Second, a communication graph which links either output channels or sources to either input channels or sinks. The communication graph can be quite complex, holding expression of data access patterns including but not limited to permutations, with possible duplication or decimation (without any change to the transfered data). Among the desired properties of the system, the stream paradigm eliminates race conditions by construction, and the presence of dead-locks can be detected at compile time. One possible restriction for stream programs is to conform to the Cyclo-Static Data Flow (CSDF) model [10], shortly introduced in subsection 4.2. Actually, the programing language ΣC [12] defines a superset of CSDF which remains decidable though allowing data dependent control to a certain extent. Nonetheless, as the focus is not set on the programming aspects in this paper, let us suppose that it is limited to CSDF.

4.1

CONTRIBUTIONS

Communication graph

Usual elements of communication graphs are simple channels (possibly with distinct production and consumption rhythms), splitters (including duplicators or dispatchers), joiners, and feed-back loops. Filters can have the same semantic as some graph elements (e.g. splitters) if defined accordingly. But one of the main advantages to express them in the communication graph is that properties such as deadlock freeness and execution in bounded memory can be automatically checked (with automatic sizing of the ex-

We propose a micro-kernel architecture for the support of stream applications on embedded multi-cores, based on a Logical Vector Time (LVT) execution model. We developed a prototype for this micro-kernel and provide a first evaluation on a set of stream processing applications. To the best of our knowledge, there is no micro-kernel approach to the execution support of stream applications, and the work we present here shows it is a safe and efficient approach.

145

t

change buffers). Moreover, transformations can be inferred to produce an equivalent graph that would e.g. reduce the memory requirements or the throughput, as in [20], or as done in StreaMIT [24].

{ +2 ; +1 } { -1 }

Channels and communication buffers

v

Figure 2: An example of a simple CSDF communication graph, with three pipelined filters t, u and v (the corresponding dependency graph is given in Figure 3); the production sequences on channels are presented on the right of the edges, the intake sequences on the left.

t.0

u.0

t.1

t.2

u.1

u.2

v.0

Filters

In this paper, we limit the task model to CSDF. In order to distinguish the task as seen in the CSDF model (so in the programing model) from the actual task that can run on a multi-core target system (hence in the execution model), the word filter is used here to refer to the CSDF tasks. Therefore, in the CSDF model a filter has in general several input channels and several output channels. The number of data tokens produced (resp. consumed) in a channel are set at compile-time, and may change from one activation of the filter to the next following a static cycle. For example, if c is an input channel of a given filter, with an intake cycle defined by a suite of nc positive integers xcj , then the number of data tokens consumed on channel c at cycle i is xck with k = i mod nc (cf. Figure 2). Filters are in general stateful. They can do any kind of operation with the data they have in input channel and proceed them to the output channels. Nonetheless, they are not authorized to use shared variables and the channels are the only way a filter can communicate values to another (this can be formally enforced by an adequate link edition process).

t.3

u.3

u.4

v.1

u.5

v.2

Figure 3: An example of dependency graph produced for the pipeline of Figure 2; node labels represent the owner filter and the rank of the corresponding activation, and the dashed arcs represent dependencies for a possible buffer dimensioning.

circular buffers become a simulation of the channels in the communication graph. We said in 4.2 that filters had a cyclic communication behavior. The compilation process ensures that tasks (compiled filter or set of filters) also have such a cyclic behavior. For a compiled task, an activation is an event in the task’s lifetime corresponding to its activation to produce and/or consume data as defined in its cyclic communication state machine. The partial order is created from the constrains that a given task should not start an activation before its input buffers hold enough data (i.e. the task producing data in the buffer completed the producing activation), and its output buffers have free space as to prevent an overwrite of data still consumed by other tasks (i.e. all the consuming tasks have completed enough consuming activations). The task activations that are not comparable with respect to this partial order can be executed in parallel. The other activations must wait until all preceding activations are completed before task activation is possible. In the compilation process, we can encode the partial order in the form of a Directed Acyclic Graph (DAG), called the dependency graph (cf. Figure 3), denoted G(V, A). A node eit ∈ V represents the ith activation of task t, and an arc (eit , eju ) ∈ A tells eju depends on eit (denoted eit → eju ). The subset of activations of one task t in the dependency graph is denoted Vt . The dependency graph is a finite object, which gives the dependencies between activations of tasks such that, when

Filters and tasks The tasks are entities that actually run on a given platform. They differ from filters in the sense that their communication channels have been mapped to buffers in memory (on-chip Memory if available). Moreover, several filters may have been merged together to form one single task and usually when complex communication patterns are used (as expressed by composition of splitters, joiners, etc) they are also often merged with filters: this is done as part of the compilation optimization process, as in [5].

4.3

{ +1 } { -2 }

The goal of the compilation is to transform the communication through channels into efficient communication through shared memory-mapped circular buffers. The compilation tools are used to ensure that the application liveness property is fulfilled provided that task scheduling is correct. Mapping channels into circular buffers is a first step to provide a pointer equivalence (at the fist level), so that application code can access data as ordinary arrays in C language without having to compute complex index values for each access (this should hold true for sliding windows as well). Moreover, as only one filter is authorized to write a given value, while one or many filters can read it, flushing/invalidating L1 caches at design points can ensure memory coherency. This is one of the main advantages of stream programing for embedded targets where cache coherency mechanisms are often lacking because of their silicon surface and power consumption.

4.2

u

Extracting task dependencies

As said through this section and especially in subsection 4.1, the correct execution of the system relies on a correct partial order of execution of the compiled tasks so that the

146

have a poset (H, v) isomorphic to the partial order defined by the dependency graph. The cyclic property of task behavior implies that in the dependency graph, task dependencies repeat themselves following the same pattern over and over. So, by transitivity, once a task has received at least one dependency from every other task in the computation, it will receive them cyclically for ever after. Considering the time-stamp of this initial activation and the task’s cycle length, we can build a cyclic set of increment vectors. Incrementing the initial time-stamp using this finite cyclic increment set gives us a vector clock that can theoretically be infinitely incremented (see subsection 5.3 for details on how this is achieved). To compute the initial clock values and the increment set, the algorithms have a quadratic complexity. Thus it is not a problem to generate runtime data structures even for thousands of tasks on an average workstation (less than 1 second for 1000 tasks on a workstation equipped with Intel Core i5 CPU M520).

all activations present in the graph are executed, the buffers return to their initial state (all produced data was consumed). It means that the application behavior described in the graph can be replicated as many times as needed in the analysis process.

5.

LOGICAL VECTOR TIME BASED EXECUTION MODEL

The dependencies between task activations in the dependency graph can be used at runtime to decide whether a task is ready or not. To extract and use efficiently this information from the graph, we propose an execution model based on Logical Vector Time (LVT), providing deterministic and safe execution of stream programs. It is based on assigning a vector clock to each task in the set T of all compiled tasks in the application. The clock is updated and compared at runtime to decide whether current activations of two tasks are ordered or causally independent. To achieve this, we need to compute off-line a partially ordered set (poset) of vector time-stamps, isomorphic to the partial order defined by the dependency graph. Further analysis on the vectors produced allows us to find a time-stamp defining an initial clock value for each task, and a finite set of vector increments to infinitely update this value.

5.1

5.2

Using vector clocks for task activation

To each task t ∈ T , we associate the following runtime data: 1. a vector clock dt , of dimension |T |, representing its current availability date;

Generating vector clocks

2. an integer kt used as a dependency counter;

We first summarize here the elements from partially ordered set theory [6] necessary to describe our execution model, and the way to obtain the runtime data structures necessary to execute the application. Let X be a set of elements. A partial order on X is a relation, denoted v, reflexive (x v x, ∀x ∈ X), antisymmetric (x v y ∧ y v x ⇔ x = y, ∀x, y ∈ X), and transitive (x v y ∧ y v z ⇔ x v z, ∀x, y, z ∈ X). The pair (X, v) defines a poset. We also define a relation < as x < y ⇔ x v y ∧ x 6= y, ∀x, y ∈ X. To build the poset of vector time-stamps (H, v), each activation e in the dependency graph is affected a unique vector h(e), with |T | coordinates, using Algorithm 1, a variant of a well-known algorithm [9, 19].

3. an incrementation rule, δt : Z|T | −→ Z|T | , used at the end of every activation of t to update dt . A task with kt = 0 is by construction ready for its execution. A simple sequence of clock comparisons can initialize the dependency counter of a task t: kt = 0, for each task v in T \ {t}, if dv < dt then kt = kt + 1. As the initial values of vector clocks encode a partial order, there must be at least one task t such that kt = 0 (i.e. there is always at least one vertex without predecessor in the dependency DAG). At the end of every activation of a task t, applying the Algorithm 2 updates the LVT and task dependencies. It is important to mention that this algorithm can be parallelized, and that the set of tasks used in update can be significantly reduced.

ALGORITHM 1: Computing vector time-stamps in the dependency graph Input: Dependency graph of the application(G(V, A)) Output: Set of vector time-stamps (H) for each activation eit ∈ V do ∀v, h(e)[v] = 0 end for each activation eit ∈ V , in topological order do h(eit )[t] = i; for each direct successor eju to eit do ∀v, h(eju )[v] = max(h(eju )[v], h(eit )[v]); end end

5.3

Optimizing vector clocks for embedded environments

Now we have to discuss complexity of the algorithms of this execution model. A direct implementation would certainly not show performance, as every activation termination costs O(3T 2 ). To reduce the cost of this operation, we can first reduce the set of tasks v to which vector clock of ended task t is compared. The transitivity of the relation v allow us to update t only using its direct neighbors N in the dependency graph. This reduces the complexity to O(3T × N ), knowing that in usual graphs the average value of N is 4. The second reduction is based on the fact that in our specific context, we know which tasks own the activations being compared. This allows us to take advantage of the vector clock property extracted from the LVT construction in subsection 5.1: eit → ejv ⇔ h(eit )[t] ≤ h(ejv )[t]. Thus

This algorithm guarantees that a vector coordinate h(eit )[u] is monotonically increasing, and eit → eju ⇔ h(eit )[v] ≤ h(eju )[v], ∀v. This relation can be narrowed down to eit → eju ⇔ h(eit )[t] ≤ h(eju )[t] when t and u are known, as shown in [16]. Thus, given H the set of the vector time-stamps, and given the relation h(eit ) v h(eju ) ⇔ h(eit )[v] ≤ h(eju )[v], ∀v, we

147

domain embedded many-core architecture; 2) a x86 multicore PC.

ALGORITHM 2: Updating task runtime data at end of activation Input: Set of tasks in computation (T ), task at end of activation (t ∈ T ) Output: Set of tasks in computation (T ), task at start of next activation (t ∈ T ) for each task v ∈ T \ {t} do if dt < dv then kv = kv − 1; end end dt = δt (dt ); for each task v ∈ T \ {t} do if dt < dv then kv = kv + 1; end if dv < dt then kt = kt + 1; end end

6.1

deciding dt < dv is a O(1) operation, reducing complexity of end of activation update to O(3N ). Another issue with vector clocks is their size in coordinates, the memory they use and the cost to increment such vectors. Many works aim at reducing the vector dimension [21, 13, 26, 27], but finding the smallest dimension is a NP-complete problem [28], it might give no reduction of initial vector size [25] and the existing methods are in general rather costly and more importantly remove the possibility to compare clocks in O(1). In our case, vector dimension can be significantly reduced by cutting the dependency graph into subgraphs before computing the time-stamps. This gives multiple vector clock posets, with incomparable elements across different posets (with sufficiently small subgraphs, we can achieve any vector dimension). The synchronization between activations of tasks across different posets can then be made through different methods, or using a hierarchical LVT. Actually, for most of the targets we studied, the chip consists of interconnected clusters over a Network on Chip (NoC), which forces the division of the global dependency graph anyway and allows synchronization using a simple network protocol. Finally, we have to settle the need of infinite vector coordinates. We could use 64-bit integers to encode them, which would functionally prevent an overflow, but this is not satisfactory and ruins the effort to reduce memory usage. Instead, we increment and compare the coordinates using a global modulo M , computed off-line as M > 2T , T being the maximum advance a producing task can have on its consumers. We simply associate the relation  to the integer set {0, . . . , M − 1} with x  y ⇔ (x ≤ y ∧ y − x ≤ T ) ∨ (x > y ∧ M + y − x ≤ T ). This generally results in coordinates encoded on 8 bits or 16 bits, allowing acceleration using specialized vector instructions on some targets. Figure 4 shows the complete dependency graph for the pipeline of Figure 2, with vector timestamps on each activation, from which the runtime data can be extracted.

6.

Asymmetric kernel

The processor cores of the target are divided in two categories: 1) Control Core (CC), which is in charge of the main part of task scheduling, and supervises the other cores; 2) Processing Core (PC), which is in charge of user computation, inter-process communication and of a minor part of task scheduling. There is always one CC, and at least one PC. This repartition takes benefit from the parallelism of target architectures to optimize the scheduling operations, by minimizing the scheduling overhead on the PC (they only perform small operations that do not imply race conditions on the different PC). Using wait-free algorithms [14], the CC can perform the remainder of the scheduling operations while the PC keep running ready tasks (if any left). The only explicit synchronization occurs when tasks are inserted or removed from a lock-free FIFO, and possibly when a core wakes an idle one on some targets when power saving is important. The number of PC must be adequate to avoid making the CC a bottleneck. This is not a problem when partitionning the available cores into logical or physical computation clusters is considered (cf. subsection 1.2).

6.2

Cyclic buffers for channel implementation

The communication layer is totally handled on the PC. Each channel in the communication graph is implemented as an exchange buffer in the shared memory. Though mapping the channels to cyclic buffers with firstlevel pointer equivalence has been an important part of our prototyping, it is only a detail in this paper focusing on the execution model. We only present here the description of the operations performed by the IPC mechanisms of our micro-kernel necessary to understand the results shown in the evaluation section. The micro-kernel provides each task with production access (resp. intake access) to the exchange buffers, which are statically allocated. For each buffer it accesses, a task has an access descriptor. It gives a description of cyclic access behavior of the task in the memory region, and the current status in this cyclic behavior (including current alias pointer into the buffer). These access aliases allow an easy implementation of the complex acccess schemes described in the communication graph. At the end of every activation of a task, the access descriptors of the buffers used at ended activation and used at next activation are updated. This update is a mere modulo incrementation of the alias pointer. In some rare cases, a shadow copy is performed to enforce pointer equivalence in user code. The blocking read operations are not handled at buffer level. The scheduling used to build the dependencies between tasks (thus the LVT ruling the execution of the tasks) guarantees that there is no race condition in buffer access.

6.3

Task management

The PC is in charge of executing user code, and to provide services for inter-task communication and task activation supervision. The CC is in charge of hardware supervision and global system state updates. Initially, a PC starts in the micro-kernel context and en-

KERNEL IMPLEMENTATION

We prototyped an implementation of this execution model, as a micro-kernel for two target platforms: 1) a non-public

148

1 0 0

2 0 0

3 2 0

4 3 0

5 5 1

6 6 2

7 0 3

0 1 3

t.1

t.2

t.3

t.4

t.5

t.6

1 1 0

1 2 0

2 3 0

3 4 1

3 5 1

4 6 2

5 7 2

5 0 3

6 1 3

7 2 4

7 3 4

0 4 5

u.0

u.1

u.2

u.3

u.4

u.5

u.6

u.7

u.8

u.9

u.10

u.11

t.0

t.7

1 2 1

3 4 2

4 6 3

5 0 4

7 2 5

0 4 6

v.0

v.1

v.2

v.3

v.4

v.5

Figure 4: Two replications of the dependency graph shown in Figure 3, with the vector time-stamps over each activation, their coordinates are on a modulo 8; in this example, initial clock values are those of activations e4t , e3u and e0v , the length of the incrementation cycles are respectively 4, 6, and 3.

ters an idle state, where it waits for at least one user task to be ready (no dependencies left for current activation) and eligible (available for PC to load its context). When one or more tasks change state to ready eligible, all idle PC are woken up by the CC and try to load a ready user context. If it succeeds, it loads the user context for execution and changes task’s state to running (otherwise it returns to idle state). When a user context is loaded, the PC restore processor to protected mode and return to last position of the task in user code (or task entry at first load). In user context, a task can have access to the communication interface through local pointers to buffers, retrieved using communication system calls (the calls are inlined and execute in protected mode, as they consist only in reading the application runtime data). At the end of an activation, the task must enter the end of transition procedure through an appropriate system call. The following operations are then performed: processor enters privileged mode; the next availability date of the task is computed by incrementing the current availability date; for each access descriptor, the exchange position must be updated for next activation; PC switches to the micro-kernel context, saves the execution context of the task and updates its status to ended activation; PC performs the same operations as when woken up from idle mode. The CC never executes user code. It is the first execution resource to start, and is in charge to wake all PC which will enter idle mode and wait for kick off. It must also initialize any other hardware resource (e.g. the network interface to the NoC if any). Once the hardware setup is over, the CC wakes all PC so they start to feed on the ready task set. The rest of its execution flow is an infinite loop, waiting for service requests from hardware resources (including PC). When a PC requests end of task treatment for the task it finished, the CC performs LVT comparisons and updates dependency counters accordingly, and changes state to ready for every task whose dependency counter reached zero.

7.

architecture has also been evaluated, but we cannot publish the results. The x86 implementation has been evaluated using a workstation based on an Intel Xeon 2.53GHz with 8 hyper-threaded cores (dispatched as 1 CC, 7 PC), with a shared L3 cache of 12 MB. The micro-kernel was compiled using gcc 4.6 for x86 targets, with an optimization flag -O3. The execution times were extracted by reading performance monitoring devices provided by the respective targets. On x86, the measures are based on the cycle-accurate time-stamp counter. The results we produce are average values, based on execution times collected during the execution of ten thousand replications of the dependency graph for each application. The applications used were written or translated to the ΣC language (cf. section 4), from which we could generate runtime data and compile application binaries for the microkernel: 1. FFT2 is a translation of the StreaMIT implementation of a Fast Fourrier Transform for benchmarking, this application was chosen to make a comparison to the static scheduling approach for RAW; 2. BeamFormer is a translation of the StreaMIT implementation of a beam former for benchmarking, it was chosen for the same reasons as FFT2; 3. Laplacian is an application that performs edge detection on a flow of input images by performing a convolution on lines and columns using Gaussian kernels, it was chosen because it is highly parallel and adapted to our targets; 4. MotionDetection is an application that performs target tracking on a flow of related input images (image differentiation, thresholding, connected component extraction and fusion as well as temporal target matching), it was chosen because task behavior is dynamic (due to the randomness in the effective number of targets) and thus exposes the advantages of our dynamic execution model.

EVALUATION AND BENCHMARKS

This section gives some benchmarks, to evaluate the system overhead for a stream application executed over our LVT execution support. The measures are only given for an implementation on a x86 multi-core workstation. Another implementation targeting a non-public domain embedded

Table 1 gives important information to describe the applications: count of tasks in application, size of clock vectors in coordinates and the average execution times for one activation of the tasks.

149

Table 1: Properties of the applications used for measurement (x86 target) Application Tasks LVT Average task activation Cycles Time (µs) FFT2 15 16 767 0.307 BeamFormer 53 56 2019 0.808 Laplacian 16 16 6970 2.788 MotionDetect. 47 48 223777 89.510

7.1

Table 2: Measurement of system operations (x86 target) Application System operations for a task activation Operation Cycles Time (µs) FFT2 Scheduling (CC) 704 0.281 Scheduling (PC) 684 0.273 IPC (PC) 167 0.067 BeamFormer Scheduling (CC) 691 0.277 Scheduling (PC) 811 0.324 IPC (PC) 195 0.078 Laplacian Scheduling (CC) 790 0.316 Scheduling (PC) 781 0.294 IPC (PC) 253 0.101 MotionDetect. Scheduling (CC) 720 0.288 Scheduling (PC) 803 0.321 IPC (PC) 229 0.091

System execution time

Table 2 gives measurement of the execution time of system operations (cf. subsection 6.3 for detailed description of the operations). The entry Scheduling (CC) gives the average execution time of the LVT update and task list management on the CC. The entry Scheduling (PC) gives the average execution time of the LVT update, task election and context switch on the PC. The entry IPC (PC) gives the average execution time of the buffer access descriptor updates on the PC. Table 3 gives the overhead of the micro-kernel for each application, obtained from the division of the total time spent in the micro-kernel on the PC by the total time spent in user computation. First of all, we can observe that though the applications have different granularities (from 15 to 53 tasks), the average time spent for system operations is almost the same. This is possible thanks to the optimizations we implemented for the comparison and incrementation of the clock vectors (cf. subsection 5.3). This demonstrates the scalability of our approach, and the good properties of vector time for dynamic scheduling. Though the target architectures and scheduling approach are quite different, we compare our results to those of the static scheduling for StreaMIT on the RAW architecture. For StreaMIT benchmarks, the micro-kernel overhead is very high, and for these two specific examples the static scheduling is far better than our approach. This is due to the very short average execution time of the tasks in these applications. On the other hand, when executing applications with tasks performing complex and rather long operations, such as our Motion Detector, the micro-kernel inflicts a very low overhead, and the application can take full advantage of our dynamic scheduling. This demonstrates the importance of task granularity, which can be controlled by the compilation tools to merge several parallel tasks, in order to reach an appropriate minimum execution time (as in [11]).

7.2

Table 3: Measurement of system overhead (x86 target) Application System overhead FFT2 113.80% BeamFormer 49.96% Laplacian 14.17% MotionDetect. 0.45%

Figure 6 shows the detailed memory usage for the x86 micro-kernel. The total memory usage is given, along with memory usage per module in the micro-kernel. Compared to the library approach, the micro-kernel uses much less memory. When supporting stream programming with a library over on-the-shelf OS, one must pay memory for the library and the OS. An embedded Linux kernel alone, stripped of all unnecessary drivers, uses 197 KB. With our micro-kernel approach, we use 67.31% less, taking the runtime data into account. Memory overhead (KB) 70

Runtime data Micro-kernel Total

60 50 40 30

Memory footprint

20

The measures of memory usage are given for the x86 binaries. Figure 5 shows the memory usage for the runtime data generated for the tasks, which consist essentially of the vector clocks, the vector clock increments and the access descriptors. It also gives the memory used by the micro-kernel, whose size does not depend on the application, and the total memory overhead. The maximum memory overhead we measured is 64.4 KB (48,7 KB average), which represents an average memory overhead of about 2% of on-Chip shared memory on recent embedded multi-core designs (ranging from 2 to 4 MB). On the x86 platform used for evaluation, it represents 0.4% of the shared L3 cache of 12MB.

10 0

MotionDetect.Laplacian

FFT2

BeamFormer

Figure 5: Memory used by runtime data, and by micro-kernel in the x86 test binaries; the average memory overhead is 2% for recent embedded multicore designs with 2MB of shared memory, which is at least 67.31% less than using a stream programming support library over a minimal Linux kernel.

150

[5] L. Cudennec and R. Sirdey. Parallelism reduction based on pattern substitution in dataflow oriented programming languages. In Proceedings of the 12th International Conference on Computational Science, ICCS’12, Omaha, Nebraska, USA, 2012. To Appear. [6] B. A. Davey and H. A. Priestley. Introduction to lattices and order. Cambridge University Press, New York, NY, 2002. [7] P. Dumont and P. Boulet. Another multidimensional synchronous dataflow: Simulating Array-OL in Ptolemy II. Technical Report RR-5516, LIFL, USTL, Mar 2005. [8] C. J. Fidge. Partial orders for parallel debugging. In Proceedings of the 1988 ACM SIGPLAN and SIGOPS Workshop on Parallel and Distributed Debugging, pages 183–194, 1988. [9] C. J. Fidge. Timestamps in message-passing systems that preserve the partial ordering. In Proceedings of the 11th Australian Computer Science Conference, pages 56–66, 1988. [10] R. L. G. Bilsen, M. Engels and J. A. Peperstraete. Cyclo-static data flow. IEEE Transactions on Signal Processing, 44(2):397–408, 1996. [11] M. I. Gordon, W. Thies, M. Karczmarek, J. Lin, A. S. Meli, A. A. Lamb, C. Leger, J. Wong, H. Hoffmann, D. Maze, and S. Amarasinghe. A stream compiler for communication-exposed architectures. SIGOPS Oper. Syst. Rev., 36(5):291–303, Oct. 2002. [12] T. Goubier, R. Sirdey, S. Louise, and V. David. ΣC: A programming model and language for embedded manycores. In Y. Xiang, A. Cuzzocrea, M. Hobbs, and W. Zhou, editors, ICA3PP (1), volume 7016 of Lecture Notes in Computer Science, pages 385–394. Springer, 2011. [13] M. Habib, M. Huchard, and L. Nourine. Embedding partially ordered sets into chain-products. In Proceedings of the 1995 International Symposium on Knowledge Retrieval, Use, and Storage for Efficiency, Santa Cruz, CA, 1995. University of California. [14] M. Herlihy. Wait-free synchronization. ACM Transactions on Programming Languages and Systems, 13(2):124–149, 1991. [15] G. Kahn. The semantics of a simple language for parallel programming. In J. L. Rosenfeld, editor, Information processing, pages 471–475, Stockholm, Sweden, Aug 1974. North Holland, Amsterdam. [16] A. D. Kshemkalyani and M. Singhal. Distributed Computing: Principles, Algorithms and Systems. Cambridge University Press, Baltimore, MD, 2011. [17] E. A. Lee. Finite state machines and modal models in Ptolemy II. Technical Report UCB/EECS-2009-151, EECS Department, University of California, Berkeley, Nov 2009. [18] E. A. Lee and T. Parks. Dataflow process networks. In Proceedings of the IEEE, pages 773–799, 1995. [19] F. Mattern. Virtual time and global states of distributed systems. In Proceedings of the International Workshop on Parallel and Distributed Algorithms, pages 215–226, 1988. [20] P. Oliveira Castro, S. Louise, and D. Barthou. Reducing memory requirements of stream programs

Memory used by micro-kernel (KB) 30

Size

25 20 15 10 5 0

t

or pp

su

g

in

e ar

l du

he

c

ip

sc

w rd ha

l ta

to

Figure 6: Detail of memory used by micro-kernel modules in the x86 test binaries; total size is 28.9 KB, which is 85.33% less than a minimal embedded Linux kernel.

8.

CONCLUSION AND FUTURE WORKS

This paper showed that building a specialized micro-kernel gives an efficient execution support for the currently emerging generation of stream programming languages, with very low latencies on a first selection of mainstream applications. Our prototype will be improved with additional code optimizations and more subtle task list management (efficient scheduling policies would significantly improve performance and power saving). With very fine grained applications, the control core may become a bottleneck (which is unlikely, as the latencies are very tight) and the micro-kernel overhead becomes too high. This is not a problem for most of the real applications targeting embedded multi-core (which have sufficiently high execution times per task), and it would anyway be quite easy for the compilation tools to avoid this situation (e.g. a fusion of several parallel tasks to reach an appropriate minimum execution time [11]). Apart from code optimizations, our future work will be to extend the micro-kernel to non-shared memory systems, with efficient communication between physical or logical clusters. This raises several issues, as core and task migrations, which would be interesting to address.

9.

REFERENCES

[1] A. Amar, P. Boulet, and P. Dumont. Projection of the Array-OL specification language onto the Kahn Process Network computation model. Technical Report RR-5515, LIFL, USTL, Mar 2005. [2] S. Amarasinghe, M. I. Gordon, M. Karczmarek, J. Lin, D. Maze, R. M. Rabbah, and W. Thies. Language and compiler design for streaming applications. International Journl of Parallel Programming, 33(2/3), Jun 2005. [3] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for gpus: stream computing on graphics hardware. ACM Trans. Graph., 23:777–786, August 2004. [4] J. T. Buck and E. A. Lee. Scheduling dynamic dataflow graphs with bounded memory using the token flow model. Technical report, 1993.

151

[21]

[22]

[23]

[24]

by graph transformations. In Proc. of the Int. Conf. of High Perf. Computing and Sim. (HPCS), pages 171–180, 2010. Ø. Ore. Theory of graphs, volume 38. American Mathematical Society Colloquium Publications, Providence, RI, 1962. D. Parker, G. J. Popek, G. Rudisin, A. Stoughton, B. J. Walker, E. Walton, J. M. Chow, D. Edwards, S. Kiser, and C. Kline. Detection of mutual inconsistency in distributed systems. IEEE Trans. Software Ingineering, 9:240–247, 1983. M. Singhal. A heuristically-aided algorithm for mutual exclusion in distributed systems. IEEE Trans. on Computers, 38(5), 1989. W. Thies, M. Karczmarek, and S. Amarasinghe. Streamit: A language for streaming applications. In R. Horspool, editor, Compiler Construction, volume 2304 of Lecture Notes in Computer Science, pages

[25]

[26]

[27] [28]

[29]

152

49–84. Springer Berlin / Heidelberg, 2002. 10.1007/3-540-45937-5 14. W. T. Trotter. Combinatorics and Partially Ordered Sets: Dimension Theory. Johns Hopkins University Press, Baltimore, MD, 1992. P. A. S. Ward. An offline algorithm for dimension-bound analysis. In Proceedings of the 1999 International Conference on Parallel Processing. IEEE Computer Society, 1999. J. Ya˜ nez and J. Montero. A poset dimension algorithm. Journal of Algorithms, 30(1):185–208, 1999. M. Yannakakis. The complexity of the partial order dimension problem. Journal on Algebraic and Discrete Methods, 3(3):351–358, 1982. X. D. Zhang, Q. J. Li, R. Rabbah, and S. Amarasinghe. A lightweight streaming layer for multicore execution. In Workshop on Design, Architecture and Simulation of Chip Multi-Processors, Chicago, IL, Dec 2007.