Collision Detection on Multicore Architecture

formable objects. Another kind of approaches use graphics hardware [Myszkowski et al. 1995] [Wong 2004]. Most of the CD queries belongs to: interference and.
869KB taille 1 téléchargements 292 vues
Collision Detection on Multicore Architecture: Addressing the Load Balancing Issue J´er´emie Le Garrec, Jean-Thomas Acquaviva, Xavier Merlhiot CEA-LIST, 92265 Fontenay aux Rose, France contact: {jeremie.le-garrec, jean-thomas.acquaviva, xavier.merlhiot}@cea.fr

1

Introduction

To cope with near real-time constraints, acceleration structures based on Bounding Volume Hierarchy (BVH trees) are commonly used in Collision Detection algorithms (CD) and are widely accepted as a convenient and efficient optimization. Consequently, the problem of CD for interactive simulation boils down to huge trees traversal with dynamic branch pruning and soft real time constraints. To accelerate this computation on multicore architectures it is necessary to exploit the parallelism present in a complex geometrical scene. (a) Collision between two simple objects

(b) Two-way tree traversal

Figure 1: Two simple objects in collision and a radial representation of the resulting CD tree traversal. The tree is traversed according to an original algorithm in two different directions: first nodes in blue are explored in a weight driven way to extract parallelism, afterward the nodes in red are explored in depth first and in parallel by multiple threads.

Abstract Multicore architectures are already omnipresent, surprisingly few research have been done on their impact on Collision Detection algorithms. Most of the traditional implementations of collision detection are oriented toward sequential performance, but they are now strongly challenged by multicore architectures which require parallelism. In this paper, we detail a novel approach to extract parallelism from collision detection based on bounding volume hierarchies and to cope with load balancing issues. The collision detection is evaluated with an original two stages scheme. The first stage is a weight driven tree exploration, aiming at detecting independent tasks of equivalent computational time. Thanks to carefully designed weight metrics, tasks are generated in sufficient quantities to feed all CPU cores but not too much to limit synchronization overhead. Afterward, during the second stage, independent tasks are executed in parallel. We detail three different weight metrics and we discuss thoroughly their results: one metric is based only on temporal coherency, the second one purely on geometrical considerations and the last and the best performing weight metric is hybrid. Overall, we present substantial performance gains, and we achieve a speed-up of 6.9 on 8 cores for real industrial case. Keywords: Collision Detection, Multicore, Parallel, Load Balancing, Virtual Prototyping, Bounding Volume Hierarchy

1.1

Collision Detection: a complex tree traversal problem

CD query in a complex scene is commonly divided in a broad phase and in a narrow phase. The broad phase identifies smaller groups of objects that may be colliding, while the narrow phase constitues the pairwise tests within subgroups. We put a particular focus on the narrow phase, which is more complex and more important for the overall execution time. The narrow phase is often based on BVHs. For the price of a limited memory footprint extension (a factor of 2), using this kind of hierarchies allows to decrease the overall complexity of a CD query between two objects from O(n2 ) to n log(n). Where n is the number of geometrical primitives. Once objects are represented as a binary trees, the CD query between two objects could be seen as the cartesian product of two binary trees i.e. a quad-tree (referred as CD tree). Hierarchical methods have the advantage to use simple tests to determine if a given node in the hierarchy should be pruned from the search. To limit the memory footprint, CD trees are always examined in depth first. This allows to perform the complete tree traversal using a stack which remains smaller than the tree height, i.e. log(n). This optimization introduces a complex behavior, because the tree traversal is unpredictable. About memory layout, algorithms such as [Yoon et al. 2005] [van Emde-Boas et al. 1977] limit the impact of this unpredictability on cache performance. From a parallelization point of view, the unpredictability remains problematic because it leads to load unbalancing. Achieving a correct speed-up for a parallel tree descent with dynamic branch pruning has no convincing solution yet.

1.2

Parallelism Exploitation and Extraction

To accelerate computation on multicore architectures it is necessary to exploit the parallelism already present in a complex geometrical scene. For instance, numerous objects could potentially be colliding against each other. Since each potential colliding pair of objects leads to a CD task, such tasks could be executed in parallel. We refer to the parallel execution of independent tasks as parallelism exploitation. After tasks execution, a phase of synchronization, which

// Building a list of independent tasks for (i=0, i < NB object in the scene ; i++) for (j=i+1, j < NB objects in the scene; j++) task = new cdTask(objet i, object j) runTimeTaskList.push(task) #Begin parallel section for each task in runTimeTaskList task->detectCollision() #End parallel section // Sequential reduction for each task in runtimeTaskList task->gatherDetectedContactPoint(global results)

Figure 2: Pseudo code for a CD algorithm exploiting parallelism.

is basically a reduction, is necessary to gather contact points from every tasks (see pseudo code in figure 2). During our experiments we observe that the reduction cost is negligible in term of computational time. While being mandatory, this approach is not satisfying because of heterogeneity of objects complexity in a scene. A single pair of objects could easily concentrate most of the computational time of collision detection. This is typical in industrial cases where the user is displacing a single object in the scene. At worst a scene can be composed of only two very large objects, leading to a single CD query. In such cases, the performance bottleneck is the absence of parallelism to exploit. Therefore, in this paper we present a novel approach to extract parallelism from a single task. This method takes into account synchronization cost and overhead due to tasks management, relying on carefully designed metrics to estimate task weight, it extracts as many parallelism as needed but not more. The paper is organized as follow: section 2 proposes an overview of the related works. Section 3 describes the very core of our method, named task calibration. Section 4 deals with application of task calibration to the particular case of CD trees. Sections 5, 6 and 7, propose three weight metrics. Experimental results are discussed in section 8. Finally we conclude and provide opening remarks in section 9.

2

Related Works

The determination of contact between rigid or deformable bodies is a shared issue among several research areas and technical applications: including computational mechanics, computer graphics and real-time mechanical simulations (for computer games, virtual reality, haptics and robotics). Due to the immense literature on the subject, for a more complete panorama, we orient the reader in direction of several good surveys that can be found in [Lin and Gottschalk 1998], [Teschner et al. 2005], [Klein 2005] and the book [Ericson 2004] which also details many important implementation issues. More specifically about BVHs, an important design choice is the kind of volume: spheres [Hubbard 1996], OBBs [Gottschalk et al. 1996], DOPs [Klosowski et al. 1998][Zachmann 1998], Boxtrees [Agarwal et al. 2001], AABBs [van den Bergen 1997] and convex hulls [Ehmann and Lin 2001]. BVHs are also suitable structures for time critical CD. Indeed we can interrupt easily the traversal when the time budget is exhausted [Hubbard 1996]. This trade-off between accuracy and execution time does not fit our needs because we want an exact CD result. An alternative to BVHs are space-subdivision approaches,

for instance by an octree [Kitamura et al. 1998] or a voxel grid [McNeely et al. 1999]. These non-hierarchical data structures are more efficient for collision detection of deformable objects. Another kind of approaches use graphics hardware [Myszkowski et al. 1995] [Wong 2004]. Most of the CD queries belongs to: interference and proximity test or LMD (Local Minimal Distance) [Johnson and Cohen 1998]. Our framework, while being mostly centered around quasi-LMD (an improvement of LMD approach [Merlhiot 2007]) could be easily extended to extract parallelism from interference and proximity tests. Ray tracing algorithms share the same acceleration structures than CD queries. They also deal with traversal of complex trees. A consequent literature is dedicated to numerous parallel implementations [Shevtsov et al. 2007] [Hurley 2005]. However, the very large number of rays casted in any realistic scene leads to a number of independent tasks exceeding by far the number of CPU cores available in most systems. Therefore ray tracing algorithms do not have to proceed to parallelism extraction. Parallelism and collision detection appear jointly within multi-frequencies algorithms [Johnson et al. 2005], where part of the computation (often a spatial region where collisions are occurring) is managed by different threads. Interestingly enough, these methods exploit temporal coherency through regions growing or other schemes. While being extremely appealing, they present two major differences with our works : firstly, they do not address load balancing at all. Regions handled by high frequency threads can have significantly different computational weights. Secondly, they broke exactitude paradigm, since they allow to miss some contact points due to synchronization issues with the low frequency task. Our industrial constraints prevent us to bear the risk of missing any contact points. The idea of performing a tree traversal in different two ways is also present in the work of [Larsson and AkenineMller 2001]. He proposes to traverse the tree in different directions depending on the depth of the node in the tree. He obtains significative results on his works mostly oriented toward deformable objects but the two traversal methods remain simple (bottom-up vs top-down). [Klein and Zachmann 2005] has addressed the problem of weight estimation for any node presents in a CD tree. His works is a major source of inspiration for our a priori weight metric discussed in section 6. Among the research specifically done on parallelization of CD, we found only a handful of references: [Figueiredo and Fernando 2004], [Grinberg and Wiseman 2007], propose a method to extract parallelism and both address load balancing. However in both cases load balancing is statically preprocessed either by a layered tree splitting or a more sophiscated vector space approach. While being very interesting these contributions do not tackle all the complexity of the problem, and fail to capture the dynamic behavior of CD on very unbalanced trees.

3

Overview of Task Calibration

The main goal of our method is to generate enough tasks to feed all available CPU cores. Nevertheless, we should not generate too many tasks in order to cap synchronization overhead. Ideally, all tasks generated should be homogeneous in term of computational cost. We then say the tasks are of the same caliber. Definition of task: In this paper, a task corresponds to an independent part of the computation. It can concern CD between different objects or an independent fragment of

a CD query. A task is not linked to an implementation, it could be executed serially, through a process (MPI), or with a thread (OpenMP or Posix). During a time step, the initial set-up for the scene is a set of tasks corresponding to all the CD queries between objects. Except for cases very specific and symmetrical, computational times of two different CD queries may differ by several orders of magnitude. To split heavy tasks and merge light ones, we introduce a weight estimation for every task. This estimation is directly correlated to the CPU time needed to perform the task evaluation. The two main requirements toward the weight estimation are:

Figure 4: Principle of a CD query: a quad-tree resulting from the cartesian product of two BVHs trees. On the left BVH trees of objects A and B. On the right the corresponding CD tree. • Ω, a threshold on task weight.

• External weighting: being able to estimate the total weight of any given task. This is necessary for merging several tasks. • Internal weighting: being able to estimate the weight of any part of a given task. This is required to split a task in a set of sub-tasks. We present a general approach to calibrate tasks assuming a weight metric in the remaining of this section. Next we address the specific case of CD hierarchies in section 4 and we propose three methods to estimate a task weight in sections 5, 6 and 7.

3.1

Tasks calibration

• ǫ, a tolerance factor on Ω. • Cart, a list to store tasks of small weight. In order to split or coalesce tasks of equivalent weight the threshold Ω is defined as: Ω=

P

w i i β · nc

where β is an overthreading factor and nc is the number of cores. i iterates over all the tasks from the initTaskList. The overthreading factor is useful to generate slightly more tasks than available CPU cores. For robustness, it is better to have a little more cost of synchronization instead of possibly starving some CPU cores. So, for each task i from the initTaskList: • if

Ω − ǫ ≤ ωi ≤ Ω + ǫ

The task i is copied directly in the runTimeTaskList. • if

ωi > Ω + ǫ

The task i is split in several sub-tasks. However generated sub-tasks may be of heterogeneous weight. Subsequently each sub-tasks is also considered as a candidate for splitting or coalescing. • if Figure 3: At the beginning of the simulation initalTaskList is built from the geometrical scene. Every time step the initalTaskList is processed and the runTimeTaskList is regenerated. Blue tasks correspond to tasks evaluated as too light to deserve an individual execution, hence they are merged together. Green task is of ideal weight, and heavy red task is split in two sub-tasks. During a preprocessing stage, initial tasks determined by the geometrical conditions are stored into a list named initTaskList. Each time step, the tasks actually executed are built from the initTaskList and pushed into a list called runTimeTaskList (figure 3). All the difficulty in this process is to move smartly tasks from the initTaskList to runTimeTaskList. Three actions are possible. The first one is to coalesce several small tasks in a single one. At the opposite, tasks representing a large execution time have to be split in multiple sub-tasks. At last, if a task fits our needs, it is simply pushed from initTaskList to runTimeTaskList. By this way, the runTimeTaskList is constituted of tasks of homogeneous execution time.

3.2

General merging and splitting algorithm

For every task i, one assumes have a weight ωi . For the purpose of our algorithm we introduce:

ωi < Ω − ǫ

The light task i is stored into the Cart list. When the whole initTaskList has been processed, the Cart contains potentially a large number of small tasks. The Cart is sorted according to the tasks weight. Up to the end of Cart, the heaviest task is merged with enough of the lightest tasks to reach Ω. Then the corresponding task is inserted in the runTimeTaskList. Experimentally, this merging heuristic produces tasks of similar weight, and is much cheaper than a simplex method which is too costly due to our soft real-time constraints. Overall this simple mechanism produces a list of tasks of homogeneous weight which can be executed in parallel. The very core of the method hence relies in the way to estimate the weight of a task.

4 4.1

Tasks calibration for CD tree General tree traversal scheme

Given two BV hierarchies, most collision detection approaches consist in traversing the hierarchies simultaneously either in depth first, breadth first or an ad hoc strategy. The synchronized exploration of two binary BHV trees is equivalent to the traversal of a CD tree (figure 4). A task corresponds to one traversal.

All nodes of a CD tree are constituted of a pair Di = (Ak , Bl ). Where Ak and Bl are bounding volume (in our case sphere) from the object A and B respectively 1 . Within our quasi-LMD framework, determining if Di contains a possible contact between geometrical primitives included within Ak and Bl is done by answering a set of tests: spheres intersection, normal cone orientation and primitives compatibility. The evaluation of these tests is referred in the remainder of this paper as node-node tests. The obvious advantage of using a depth first traversal is the guarantee that the memory footprint is ceiled by a stack of at most the height of the quad-tree. Using a breadth first traversal for the whole tree is ineffective due to the huge number of nodes for large models which leads to unsupportable memory pressure. However, from a load balancing perspective, candidate nodes for sub-task generation are mostly located in the top the tree. That is why, to capture the parallelism we introduce a tree traversal scheme which alternate between weight driven and depth first exploration.

4.2

Alternate tree traversal scheme

Alternate tree traversal is composed of two stages: the first one is a weight driven partial traversal only applied to each task from the initTaskList. The second stage is a depth first traversal and applies to all the tasks and subtasks generated (runTimeTaskList). During the first stage, nodes are examined according to a weight directed mode. This is a more sophisticated approach than a simple breadth first. Indeed, it relies on a priority queue ordered by the internal weight wi of a node Di . The decoupling of tree traversal and nodes evaluation is known in the literature as pre-order node visiting [Sedgewick 1997]. When wi matches the value Ω, a sub-task is generated and Di is set as the sub-task local root node. A pseudo-code is detailed in figure 5 As this weight driven traversal is independent for each task, it can be done in parallel. This first stage is able to drastically limit the memory consumption compare to a breadth first traversal, and parallelism is extracted with only a limited number of nodes explored. Once all sub-tasks have been generated, the second stage corresponds to a depth first traversal of their respective CD tree. Figure 6 summarizes this alternate traversal. Based on this, we propose two different approaches for the weight estimation. One way to calculate the weight is to rely on temporal coherency : the a posteriori method. Basically, this method uses the previous time step in order to estimate the execution time of the current time step. The second approach called a priori method infers the weight using a metric based on geometrical properties.

5

A posteriori weight estimation

The basic idea of this method is to take into account the temporal coherency assuming small relative displacements between objects. Every step, all CD trees need to be traversed. Nevertheless, there is a high probability due to the physical nature of the simulated phenomena that the CD results will only differ by a slight margin between two consecutive time steps. In a nutshell, the a posteriori method uses contact points detected at the previous time step as temporal information. These information are use for the current time step to perform a balanced tasks calibration.

# Begin parallel section for each task of initTaskList while (priority queue) { my nodePair= priority queue.pop() if (my nodePair.weight->meets(threshold)==true) { new task = new cdTask() new task->add(my nodePair) runTimeTaskList->insert(new task) } else { if (my nodePair.weight > threshold) { foreach child my nodePairs.children list { child.weight=child->computeWeight() priority queue.push(child) } } else // push the node to the cart { cart->add(my nodePair) } } } cart->process() # End parallel section

Figure 5: Pseudo-Code corresponding to the weight driven tree exploration.

Figure 6: On the left, the weight driven tree traversal. Red nodes are selected as root node for extracted sub-tasks. On the right, two sub-tasks explore in parallel their respective CD tree using a depth-first traversal. Notice that nodes evaluated during the weight driven stage do not need to be explored again.

5.1

External weight

The estimation of the running time required to solve a CD query between two objects represented as BVHs trees is known as the cost function [Gottschalk et al. 1996; Weghorst et al. 1984]: T = NC where N is the number of node-node tests and C is the time to process one test. So we choose N as our external weight metric. We observe experimentally a linear correlation between N and the execution time. At the beginning of a time step t, we compute the weight ωi (t) of task i as the sum of all the node-node tests Ni performed at time step t − 1. Because a task from the initTaskList could have been split in several sub-tasks, we need to sum-up the weight all its previously spawned sub-tasks. #subtasksi

ωi (t) =

X

Nj (t − 1)

j=0

1 For

the shake of clarity CD trees are represented as binary trees in the remainder of this paper.

The main interest of using the number of nodes explored

instead of using the real clock time is its reliability2 .

5.2

Internal weight and summary tree

We assume that all candidate tasks for splitting have been selected through the external weight metric. Now it is necessary to have a meaningful and manageable representation of the parsed tree to propagate temporal information within the internal metric. We propose to build a tree which represent a summary of the CD tree traversed. To summarize the execution, we record a limited set of points of special interests occurring within the tree traversal. At the end of the time step, starting from these points of interest, we build the summary tree in a bottom-up fashion starting from those specific points. In our implementation, we choose contact points as points of interest. Figure 7 illustrates the construction of a summary tree from a previously traversed tree. The internal weight ωi is defined for any internal node of the summary tree and corresponds to the total number of contact points detecting in his branch. We can now apply the calibration method (section 3). Sub-tasks are generated during a weight driven traversal of the summary tree.

(a) CD tree traversed during (b) Summary tree built for step t − 1 step t

a specific operation is needed to avoid redundant computations between sub-tasks and the initial task. To inform the initial task that some of its nodes are used as local root node by its sub-tasks, these nodes are flagged taboo. Next, all sub-taks and original tasks are traversed in depth-first in a parallel way. Any node used as local root node for sub-tasks are flagged taboo to avoid redundant computations between sub-tasks and initial tasks. Every time an original task evaluates a node during its CD tree traversal, we first check its presence in the taboo list and skip it in that case. We note that the taboo list must be limited regarding overall performance. Experimental results in section 8 include a discussion about the number of taboos and the induced overhead.

5.3

6 (c) Multithreaded tree traversal executed during step t

Figure 7: Overview of the a posteriori method. (a) During the traversal of the tree, contact points are recorded. At the end of the time step, we build a summary tree in a bottomup fashion starting from this points. (b) The summary tree is traversed in a weight driven way and sub-tasks are generated according to our internal weight metric. (c) Then, each sub-task and the original task, proceeds to a depth first evaluation of its local CD tree. To prevent from redundant work taboo flags are used . Once the calibration has been applied to the summary tree, the computation resumes with all the sub-taks and original tasks inspecting their local CD tree in depth-first. Because the initial task root node dominates the whole tree 2 It should be noted that CPU clock in multithreaded environment could be misleading and is even meaningless with hardware support for very fine grain multithreading such as hyperthreading technology.

Discussion

The inherent weakness of extracting temporal information from contact points appears when confronted to near contact between objects. In such case CD trees are deeply explored (up to leaves) but without resulting in any contact point. Consequently, even if the task represents a large amount of computation, the lack of contact points prevents any information feed-back and parallelism extraction is not possible. This questions the idea of using contact point as a starting point to built the summary from the descent of the CD tree. Some authors consider to store temporal CD information usage of the Front method [Li and Chen 1998]. However our experiments shows that while currently our summary tree is two order of magnitude smaller that the CD tree, the front can represent as much as 2/3 of the tree size. Such a large structure with read and write accesses (hence data dependencies) is unmanageable in a parallel environment. The second limitation comes from the summary tree building and traversal, which is an overhead for the purpose of extracting parallelism. However both tree traversal stages could be more closely linked and even fully merged. To overcome these limitations, we introduce an alternative method named a priori because it is relying on extrapolation of the future weights, and not analysis of the previous weights.

A priori weight estimation

The a priori method exploits results from asymptotic analysis about the expected execution time of a BVH traversal. [Klein and Zachmann 2005] have introduced the first rigorous approach to analyze performance of hierarchical collision detections. It is obvious that the number of node-node tests is n2 in the worst case, where n is the number of leaves of the tree. However, it has been noticed by many researchers that this number seems to be linear, or even logarithmic, for most practical cases. Using this kind of analysis, we can estimate for each node Di of a CD tree corresponding to bounding spheres (Ak , Bl ), the computational cost remaining until leaf nodes are reached. During the time step, the CD tree is traversed in accordance with our Alternate Tree Traversal strategy (section 4.2).

6.1

Weight estimation

So we define the weight function ωi for Di as : ωi = ∆



Pl Pk + rk rl



where Pk and Pl are the number of primitives included in the node Ak and Bl respectively. The number of primitives is exactly the number of leaves for the sub-tree generated

from an internal node in our implementation of BVH tree. rk and rl are the radius of the bounding sphere, and ∆ the distance between the node centers. We notice that this weight function is only valid when the two nodes overlap, and it can be used for external and internal weight. To determine the threshold Ω (section 3.2)we aggregate the weight ωi computed from the root nodes of each task. Ω=

6.2

wiroot β · nc i

Discussion

Hybrid weight estimation

Considering the heterogeneity in term of computation time for tasks within a geometrical scene, clearly the parallelization aggressiveness should be adapted to the size of each CD tree. Since this information comes from hierarchy traversal, it can be used during the next time step. We named our last method hybrid due to the transfer of informations from one time step to the other which is a kind of temporal coherency. The structure of the algorithm is kept almost unchanged from the a priori method, but the weight metric now depends on the number of nodes explored during the previous time step. Similarly to the external weight metric definition for the a posteriori method (section 5), we compute at the beginning of a time step t for a task i, the sum of all the node-node tests Niroot performed by all its sub-tasks at time step t − 1. We can now define the hybrid weight metric as:

ω ˜ i (t) Niroot (t − 1) ω ˜ iroot (t)

where ω ˜ is the weight formulation used for the a priori method (section 6). The threshold Ω is obtained by summing the number of node-node tests for each task of the whole scene.

P

This approach is attractive, because we can guide easily the generation of sub-tasks with equivalent weight before executing them. The design of our algorithm was driven by the idea of conceiving an method which allow to overlap the stage of parallel extraction the with the remainder of the computation. The quality of well balanced tasks generation strongly depends on the accuracy of the weight computation. Until now, we do not use the exact positions of the geometrical primitives included in the BVs Ak and Bl and we have assumed that those primitives are evenly distributed. This is obviously not always the case in practice. So, the weight could be too low or too high, regarding the effective computational cost. In practice, this method succeeds to extract parallelism from an unbalanced tree. Even if all the sub-tasks generated do not display exactly the same execution time, whereas they have been split with equivalent weight, this method manages to split all the heaviest tasks. It’s one of the desired behaviors, because heavy tasks are the bottleneck for the overall performance. As a remedy, we could use an approach similar to [Klein and Zachmann 2003], which consists in partitioning each BV in cells, allowing to capture uneven geometrical primitives distribution but with an extra computational cost. We observe also that our approach gives good results for homogeneous scenes. However, when the scene is unbalanced, i.e. some CD pairs are concentrating most of the computation while some other CD pairs are quickly evaluated, this method tends to extract the same number of tasks from each CD tree. This leads to an high overhead for small objects which do not need an aggressive parallelization scheme. To limit this unbalanced behavior, we propose a third weighting method where a limited amount of temporal information is injected in the a priori method.

7

ωi (t) =

Ω=

P

i

Niroot (t − 1) β · nc

Considering this minor change we obtained impressive gains as well for speed-up when the scene is composed of small objects, as for robustness, the method is now guaranteed against parallelism explosion (over generation of light sub-tasks).

8

Experimental results

8.1

Benchmarks Characterization

We have selected four benchmarks as particularly relevant to illustrate the behavior of the different methods presented all along this paper. Main characteristics of these benchmarks are summarized in table 1, efficiency of CD traversal is illustrated in figure 8. Figure 9 displays snapshots of simulated scenes. • 2 knobs small, this is most difficult case from all the benchmarks set, the scene is composed of only two colliding object and the two objects are very small. This benchmark is used to reveal overheads and to check for over-parallelization. • 2 knobs large, the scene is composed of two colliding objects. Therefore, there is no parallelism to exploit, and a scheme is required to extract some parallelism. Because objects are much larger this benchmark is less sensitive to parallelization overhead than 2 knobs small. • Funnel, this case represents the fall of 16 identical objects in a large funnel. Due to the number of identical objects involved, to their spatial symmetry, the computation is highly parallel and balanced. Therefore, this benchmark has enough parallelism to be exploited to enjoy a good speed-up even with a simple parallelization scheme. This benchmark is useful to measure overhead introduced by our methods to extract parallelism while parallelism is already sufficiently available. • Front car, this is an industrial test case depicting the insertion of windshield wiper inside the front of a car. It is an unbalanced scene: more than 250 different CD tasks are involved in the computation, but 60% and 30% of the whole execution time is concentrated in only two tasks. Therefore, for this case executing in parallel the tasks without proceeding to splitting is bound to a speed-up of at most 1.7.

8.2

Experimental platform

We conduct this research based on the quasi-LMD implementation detailed in [Merlhiot 2007]. From an implementation point of view, we found that openMP and specifically Intel TaskQ extension [Su et al. 2002] as very convenient. It allows to directly generate parallelism from STL high level data structures. All benchmarks where run on two Intel systems:

Benchmark name 2 knobs - large - small Funnel Front car

Number of objects

geometrical primitives

Nodes traversed

Contact points

2 2 17 23

387460 6058 48487 802712

76907 13495 255537 496912

403 26 65 252

Table 1: Benchmarks main characteristics. The number of nodes explored as well as the number of contact points correspond to the average value observed during the first 150 time steps. The initial conditions are kept unchanged in all our experiments.

(a) Collision small knobs

between

(b) Funnel

two (b) Same tree with the twoway traversal

Figure 8: Acceleration structure and dynamic pruning. Radial representation of the CD tree explored for the case 2 small knobs. Clearly the acceleration structure plays its role since for a total of over 9,000,000 possible leafs (collisions) only 13,000 nodes are explored. During the weight driven stage (in blue) 344 nodes are explored.

• IA32: a dual core2 b-core (total 4 cores) running at 2GHz, populated with 2GB of memory. • IA64: a quad Montecito bi-core (total 8 cores) cores running at 1.7GHz, populated with 16GB of memory. Code where all compiled with Intel ICC 10.0.6 using MKL 9.1. The best compilation flag observed: -O3 -fno-alias.

8.3

(a) 2 knobs

Results Discussion

Full results are detailled in table 2, and figure 10 shows the obtained speed-up on our 8 cores platform. A posteriori weighting Taboos have to be handled with care, in one hand their number is directly linked to the volume of extracted parallelism, in the other hand, checking the presence of a node during the CD tree traversal is pure overhead for the original task. To measure the overhead induced by taboo, we have forced all original tasks to check for a list of taboos during a sequential execution. The overhead is less than 1% of the execution time for up to 16 taboos. It is over 40% for a taboo list of length 128, and 512 taboos yield to x3 slow-down. Therefore, if a task has generated 512 sub-tasks, its original CD tree have been considerably pruned by sub-tasks but the remaining exploration will take 3 times the initial time. During our experiments we never observed more than xxx taboos for any of our benchmarks.

(c) Front car

Figure 9: Benchmarks used. Front car is the heavier and most realistic test case, coming directly from the industry. Funnel is used to test our method when confronted to an embarrassingly parallel scene. At the opposite 2 knobs (both large and small) illustrate a case when no parallelism is available and it has to be extracted. Arrows (green, yellows, red) represent contact points. A priori weighting The a priori weighting method is successful in extracting parallelism from otherwise sequential benchmarks. From the single task of 2 knobs (small), it extracts up to 15 tasks, and 31 from 2 knobs (large). However, as detailed by figure 11 the weigth metric appears to be weakly correlated to the real execution time of a task. Therefore despites the extraction of sub-tasks of equivalent weight, the number of nodes eventually explored by each sub-task differs from the extrapolated number of tests. This could lead to useless synchronization overhead by generating sub-tasks of negligible computational weight. Hybrid weighting The hybrid method clearly out-performs the two others weighting approaches and exhibits the best speed-up (see figure 10). This improvement is based on two reasons: 1. hybrid weighting has far less overhead than the a posteriori method. Within the a posteriori method the parallelism extraction is done in addition to the main computation. At the opposite, a shared advantage by both the hybrid and a priori schemes is the overlapping of this stage with the computation itself. 2. hybrid weighting addresses a fundamental limitation of

(a) 2 knobs large

Figure 11: Weak correlation between the number of nodes estimated to be traversed (nbNNextrapolated) and the number of nodes eventually explored (nbNNeffective) for the 2 large knobs benchmark.

the a priori weight estimation. Due to trivial geometrical rules, the weight estimations on the top of the tree are the less accurate. Basing the parallelization threshold on the least accurate geometrical approximations is clearly a bad idea. This is why, the injection of even a limited amount of temporal information allows to correct this inaccuracy.

(b) 2 knobs small

About the resources used for parallelism extraction in the hybrid method: we measure that for all of our benchmarks, length of the priority queue used for task calibration never exceeds 220 elements (as expected this maximal value was reached for the front car benchmark). The number of nodes explored during the weight driven stage is also very modest: up to 753 for the front car (0.1% of the CD tree) and down to 344 for the 2 small knobs case (2.5% of the CD tree). Parellelism exploitation vs parallelism extraction

(c) Funnel

The funnel benchmarks is embarrassingly parallel (16 independent CD tasks of similar weight). The main purpose is to check for over-parallelization. All our methods avoid this pitfall, and the speed-up always scales almost linearly. Nevertheless our task calibration comes with some overhead, such the extractions of some un-needed sub-tasks. A simple parallelism exploitation scheme (an openMP loop around the task loops as in figure 2) delivers better performance: 12.8 ms running on 8 cores.

9

(d) Front car

Figure 10: Speed-up measurements on the Montecito 8 cores system. Parallelism exploitation is never able to deliver significant performance except in the particular case of Funnel. Hybrid method remains the clear winner by both its robustness and the overall speed-up extracted.

Conclusion

Load balancing on CD is a thorny problem due to heterogeneity of individual CD tasks. One of our contribution is a task calibration scheme which generates a set of homogeneous tasks in term of computational time. Starting from any number of tasks, our calibration scheme allows to generate a number of tasks matching the number of available cores. Consequently we limit synchronization overhead. We also present an extensive implementation of the calibration scheme with three weight metrics. The first observation is that our pure temporal weight metric based on summary trees suffers from a lack of robustness. In the absence of contact points this approach fails to extract enough parallelism. Indeed, contact points may not be the relevant information to exploit between two consecutive time steps.

The second scheme is based on geometrical weight metric, it succeeds in extracting parallelism but remains too sensitive to geometrical conditions. It can lead to a consequent over-parallelization. We found than re-introducing temporal information in this second approach allows to drastically limit the overparallelization and its induced overhead. On our industrial case with a speed-up originally limited to 1.7, we succeed by extracting parallelism to deliver a speed-up of 6.9 on 8 cores. Room for improvements exists concerning the weight metric: it is possible to optimize the correlation between the weight metric and the underlying geometry. Additionally more temporal information can be injected within the CD tree. While the CD community has poured considerable amount of research into GPU, GP-GPU and other futuristic architectures, surprisingly current parallelization problems are remaining unanswered and have been underestimated. Nowadays multicores are omnipresent, but no CD algorithm is addressing in a consistent manner parallelization on such platform. We hope that our research will modestly remind to the CD community that multicore processors are still not well understood and remain an interesting field of research which should be more widely investigated.

Aknowledgements The access to the 16-cores Montecito system was made possible thanks to the PARMA project and teams from BULL • http://www.parma-itea2.org • http://www.bull.com

References Agarwal, P. K., de Berg, M., Gudmundsson, J., Hammar, M., and Haverkort, H. J. 2001. Box-trees and r-trees with near-optimal query time. In SCG ’01: Proceedings of the seventeenth annual symposium on Computational geometry, ACM, New York, NY, USA, 124–133. Ehmann, S. A., and Lin, M. C. 2001. Accurate and fast proximity queries between polyhedra using convex surface decomposition. In EG 2001 Proceedings, A. Chalmers and T.-M. Rhyne, Eds., vol. 20(3). Blackwell Publishing, 500– 510. Ericson, C. 2004. Real-Time Collision Detection (The Morgan Kaufmann Series in Interactive 3D Technology). Morgan Kaufmann, December. Figueiredo, M., and Fernando, T. 2004. An efficient parallel collision algorithm for virtual prototype environment. In ICPADS’04. Gottschalk, S., Lin, M. C., and Manocha, D. 1996. Obbtree: A hierarchical structure for rapid interference detection. Computer Graphics 30, Annual Conference Series, 171–180. Grinberg, I., and Wiseman, Y. 2007. Scalable parallel collision detection simulation. In Signal and Image Processing. Hubbard, P. M. 1996. Approximating polyhedra with spheres for time-critical collision detection. ACM Trans. Graph. 15, 3, 179–210.

Hurley, J. 2005. Ray tracing goes mainstream. In Intel Technology Journal, vol. 9. Johnson, D. E., and Cohen, E. 1998. A framework for efficient minimum distance computations. In IEEE Int.Conf. on Robotics and Automation, 3678–3684. Johnson, D. E., Willemsen, P., and Cohen, E. 2005. 6dof haptic rendering using spatialized normal cone search. In IEEE Trans. on Visualization and Computer Graphics, vol. 11, 661–670. Kitamura, Y., Smith, A., Takemura, H., and Koshino, F., 1998. A real-time algorithm for accurate collision detection for deformable polyhedral objects. Klein, J., and Zachmann, G. 2003. Time-critical collision detection using an average-case approach. In VRST ’03: Proceedings of the ACM symposium on Virtual reality software and technology, ACM, New York, NY, USA, 22–31. Klein, J., and Zachmann, G. 2005. The expected running time of hierarchical collision detection. In SIGGRAPH ’05: ACM SIGGRAPH 2005 Posters, ACM, New York, NY, USA, 117. Klein, J. 2005. Efficient Collision Detection for Point and Polygon Based Models. PhD thesis, University of Paderborn. Klosowski, J. T., Held, M., Mitchell, J. S. B., Sowizral, H., and Zikan, K. 1998. Efficient collision detection using bounding volume hierarchies of k-dops. IEEE Transactions on Visualization and Computer Graphics 4, 1, 21–36. Larsson, T., and Akenine-Mller, T. 2001. Collision detection for continuously deforming bodies. In Eurographics, 325–333. Li, T.-Y., and Chen, J.-S. 1998. Incremental 3d collision detection with hierarchical data structures. In VRST ’98: Proceedings of the ACM symposium on Virtual reality software and technology, ACM, New York, NY, USA, 139–144. Lin, M. C., and Gottschalk, S. 1998. Collision detection between geometric models: a survey. In IMA Conference on Mathematics of Surfaces, 37–56. McNeely, W. A., Puterbaugh, K. D., and Troy, J. J. 1999. Six degree-of-freedom haptic rendering using voxel sampling. In SIGGRAPH ’99: Proceedings of the 26th annual conference on Computer graphics and interactive techniques, ACM Press/Addison-Wesley Publishing Co., New York, NY, USA, 401–408. Merlhiot, X. 2007. A robust, efficient and time-stepping compatible collision method for non-smooth contact between rigid bodies of arbitrary shape. In Multibody dynamics. Myszkowski, K., Okunev, O. G., and Kunii, T. L. 1995. Fast collision detection between complex solids using rasterizing graphics hardware. The Visual Computer 11, 9, 497–512.

Sedgewick, R. 1997. Algorithms in C: Parts 1-4, Fundamentals, Data Structures, Sorting, and Searching. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA. Shevtsov, M., Soupikov, A., and Kapustin, A. 2007. Highly parallel fast kd-tree construction for interactive ray tracing of dynamic scenes. In Eurographics, vol. 6. Su, E., Tian, X., Girkar, M., Haab, G., Shah, S., and Petersen, P. 2002. Compiler support of the workqueuing execution model for intel smp architectures. In EWOMP European Workshop on OpenMP. Teschner, M., Kimmerle, S., Heidelberger, B., Zachmann, G., Raghupathi, L., Fuhrmann, A., Cani, M.P., Faure, F., Magnenat-Thalmann, N., Straßer, W., and Volino, P. 2005. Collision detection for deformable objects. Computer Graphics Forum 24, 1 (Mar.), 61–82. van den Bergen, G. 1997. Efficient collision detection of complex deformable models using aabb trees. J. Graph. Tools 2, 4, 1–13. van Emde-Boas, P., Kass, R., and Zijlstra, E. 1977. Design and implementation of an efficient priority queue. In Mathematical Systems Theory, 99–127. Weghorst, H., Hooper, G., and Greenberg, D. P. 1984. Improved computational methods for ray tracing. ACM Trans. Graph. 3, 1, 52–69. Wong, W. S.-K. 2004. Image-based collision detection for deformable cloth models. IEEE Transactions on Visualization and Computer Graphics 10, 6, 649–663. MemberGeorge Baciu. Yoon, S.-E., Lindstrom, P., Pascucci, V., and Manocha, D. 2005. Cache-oblivious mesh layouts. In SIGGRAPH ’05: ACM SIGGRAPH 2005 Papers, ACM, New York, NY, USA, 886–893. Zachmann, G. 1998. Rapid collision detection by dynamically aligned dop-trees. In VRAIS ’98: Proceedings of the Virtual Reality Annual International Symposium, IEEE Computer Society, Washington, DC, USA, 90.

Benchmarks 2 knobs large

Method

Platform

Seq. Time (ms)

1 core

2 cores

4 cores

8 cores

a posteriori a priori hybrid // exploitation a posteriori a priori hybrid // exploitation

IA32 IA32 IA32 IA32 IA64 IA64 IA64 IA64

25.5 25.5 25.5 25.5 23.8 23.8 23.8 23.8

25.6 26.2 26.1 26.1 25.6 24.0 24.2 23.8

17.1 18.0 13.8 25.6 19 13.1 15.3 23.8

13.1 14.8 7.7 25.9 18.5 12.9 10.1 23.8

na na na na 10.6 12.7 6.0 23.8

a posteriori a priori hybrid // exploitation a posteriori a priori hybrid // exploitation

IA32 IA32 IA32 IA32 IA64 IA64 IA64 IA64

2.8 2.8 2.8 2.8 2.7 2.7 2.7 2.7

2.6 2.7 2.8 2.8 3.0 2.8 2.9 3.1

2.1 1.9 1.5 2.8 1.6 2.8 1.8 2.7

1.7 1.4 0.9 2.8 1.4 2.1 1.2 2.7

na na na na 1.7 1.1 0.7 3.1

a posteriori a priori hybrid // exploitation a posteriori a priori hybrid // exploitation

IA32 IA32 IA32 IA32 IA64 IA64 IA64 IA64

87.1 87.1 87.1 87.1 114.2 114.2 114.2 114.2

86.6 87.8 87.5 87.6 114.4 107.5 107.2 114.8

47.2 45.3 49.1 45.58 58.6 59.5 61.2 57.1

34.8 24.7 27.4 32.2 25.4 29.1 28.8 23.4

na na na na 17 19.9 18.2 12.8

a posteriori a priori hybrid // exploitation a posteriori a priori hybrid // exploitation

IA32 IA32 IA32 IA32 IA64 IA64 IA64 IA64

131.2 131.2 131.2 131.2 134 134 134 134

129.4 132.1 130.6 129.8 138.8 133.1 129.4 130.4

69.0 82.3 70.1 74.2 73.3 74.1 68.0 72.3

51.1 63.4 35.1 72.7 48.8 51.1 35.8 71

na na na na 34.9 38.1 19.4 72.8

2 knobs small

Funnel

Front car

Table 2: Experimental results: Seq. Time corresponds to a sequential execution without any parallelism extraction scheme. 1 core corresponds to an execution with a single thread and all the parallelism extraction features. Therefore it is convenient to estimate the global overhead of the tested methods. 2 cores, 4 cores and 8 cores report the observed execution time. // exploitation stands for parallelism exploitation, i.e. a parallel execution of all the initial tasks present in the scene (simple omp loop on the initialTaskListi)