Jérémie Guillot High Level Synthesis and Taylor Expansion Diagrams

As a rule of thumb used in computer programming, the Pareto ... based optimization on a number of practical examples are given all along this thesis. Fi- .... optimization can only be performed when the instructions in question can be ..... first mention of TED application for HLS has been made in a paper co-authored by.
5MB taille 30 téléchargements 43 vues
Numéro d’ordre : 121

THESE / UNIVERSITE DE BRETAGNE SUD sous le sceau de l’Université Européenne de Bretagne pour obtenir le grade de : DOCTEUR DE L’UNIVERSITE DE BRETAGNE SUD

présentée par

Jérémie Guillot Laboratoire d’accueil : Lab-STICC Laboratoire des Sciences et Techniques de l’Information, de la Communication et de la Connaissance / Lorient

Mention : Electronique et Informatique Industrielle Ecole Doctorale SICMA

Optimization Techniques for High Level Synthesis and pre-Compilation based on Taylor Expansion Diagrams.

Thèse soutenue le 15 octobre 2008 devant le jury composé de : Yves Mathieu / Président / Telecom Paris Emmanuel Boutillon / Directeur de thèse / Université de Bretagne Sud Maciej Ciesielski / Co-encadrant / University of Massachusetts Bruno Rouzeyre / Rapporteur / Université de Montpellier 2 Olivier Sentieys / Rapporteur / Université de Rennes 1

1

"Copy from one, it's plagiarism; copy from two, it's research." (Wilson Mizner, 1876-1933)

Remerciements Je remercie Monsieur Yves Mathieu Professeur à Telecom Paris, qui me fait l'honneur de présider ce jury. Je remercie Bruno Rouzeyre, Professeur à l'Université de Montpellier 2, et Olivier Sentieys, Professeur à l'ENSSAT, d'avoir bien voulu accepter la charge de rapporteur. Je remercie Maciej Ciesielski, Professeur à l'Université du Massachusetts UMass (Amherst), et Yves Mathieu, Maître de conférence à l'ENST (Paris), d'avoir bien voulu juger ce travail. Je remercie aussi Emmanuel Boutillon, Professeur à l'Université de Bretagne Sud, qui a dirigé ma thèse. Malgré un emploi du temps souvent très rempli, il a su être une source intarissable d'inspiration et d'idées tout au long de ces années de doctorat. Je tiens aussi à exprimer toute ma reconnaissance à ma famille et mes amis qui d'une manière génerale m'auront soutenu (ou supporté) durant mes études. Plus particulièrement, je remercie mes parents pour m'avoir toujours laissé libre choix dans mon projet professionel, mon frère et ma soeur pour les nombreux moments partagés durant l'enfance. Je prote aussi de ce mémoire pour remercier mes amis proches: Arnaud & Adeline, Cyril & Béa, Cyril & Gwendo, Cyriaque & Sandy qui auront été source de soutien et de bonne humeur durant ces trois années (gloups presque quatre. . . ). Un petit mot aussi aux doctorants et collèguse que j'ai pu cotoyer durant cette thèse au LESTER, pour avoir tenté d'animer avec plus ou moins de succès la vie du laboratoire. Particulièrement je tiens à citer, Paul et Cyrille lors de l'organisation de Majecstic et des midi volley; Sébastien et Antoine pour leur résistance aux malts et houblons Irlandais; Marc, Pierre, Thierry, Florence et Virginie pour la grande classe des calembours de 10H. Je remercie aussi Daniel et Patricia Gomez-Prado qui, à chaque visite dans le Massachusetts m'ont fait un accueil des plus chaleureux. En plus d'avoir été de formidables partenaires de travail, ils sont devenu pour moi des amis. Enn je voudrais remercier ma compagne Hélène, pour m'avoir épaulé au quotidien et pour avoir mis au monde notre enfant Enora née le jour de mes trente ans.

2

Table of contents

Contents Table of contents

2

Introduction

7

1 State of The Art 1.1

1.2 1.3

Compiler-level optimizations . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Current methods used in compilers . . . . . . . . . . . . . . . . . 1.1.2 Compiler optimization with the GNU Compiler Collection (GCC) - Case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Domain/Target Specic Optimizations . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 Taylor Expansion Diagram 2.1 2.2 2.3

2.4

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . TED Research Timeline . . . . . . . . . . . . . . . . . . . . . . . . Construction Principles . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Mathematical background . . . . . . . . . . . . . . . . . . . 2.3.2 Formalism: TED as a particular class of graph . . . . . . . 2.3.3 Formalism: Building a polynomial expression from a TED . 2.3.4 Formalism: TED construction from a polynomial expression 2.3.5 Example of TED construction . . . . . . . . . . . . . . . . . 2.3.6 Reduction and Normalization of a TED . . . . . . . . . . . 2.3.6.1 Normalization of Taylor Expansion Diagrams . . . 2.3.6.2 Reduction . . . . . . . . . . . . . . . . . . . . . . . 2.3.7 TED composition operation . . . . . . . . . . . . . . . . . TED Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Canonicity . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Complexity of TED in terms of the number of nodes . . . . 2.4.3 Impact of variable ordering . . . . . . . . . . . . . . . . . . 2.4.4 The algorithm for local swapping . . . . . . . . . . . . . . . 2.4.5 Reordering algorithms for TED's . . . . . . . . . . . . . . . 2.4.6 Complexity of TED in terms of the number of operations . 2.4.6.1 Maximum theoretical number of edges . . . . . . . 2.4.6.2 Complexity of generated data ows . . . . . . . . 3

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

11 11 11 15 19 27

29 29 30 31 31 32 33 34 35 37 37 38 40 42 42 43 44 45 47 49 50 50

4

Contents

2.5

2.6

2.4.7 Normal Factored Form . . . . . . . . . . . . . . . . . DFG Generation: Transformation of Function into Structure 2.5.1 Cut-based TED decomposition . . . . . . . . . . . . 2.5.1.1 Additive and Multiplicative Cuts . . . . . . 2.5.1.2 Admissible Cut Sequences . . . . . . . . . . 2.5.1.3 Limitations of Cut-based Decomposition . . 2.5.2 Complex TED Decomposition . . . . . . . . . . . . . 2.5.3 Counting the Number of Operations in TED . . . . Other TED-based Representation . . . . . . . . . . . . . . . 2.6.1 TED Linearization . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

3 TED for high level synthesis 3.1 3.2

3.3

Using TED for High Level Synthesis or Software Compilation . Extensions to the original TED representation . . . . . . . . . . 3.2.1 Representing coecients of DSP transforms as variables 3.2.2 Supporting arithmetic expressions in C . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

4 Common Subexpression Elimination 4.1

4.2 4.3

Common Subexpression Elimination (CSE) . . . . . . . . . . . . . . . . 4.1.1 Static Common Subexpression Elimination . . . . . . . . . . . . 4.1.1.1 Denitions . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1.2 Basic Static Common Subexpression Elimination (SCSE) algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1.3 SCSE Results and comments . . . . . . . . . . . . . . . 4.1.1.4 Enhanced Static Common Sub Expression Elimination . 4.1.1.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Dynamic Common Subexpression Elimination (DCSE) . . . . . . 4.1.2.1 Solving the limitations of SCSE . . . . . . . . . . . . . 4.1.2.2 Preliminary version of DCSE . . . . . . . . . . . . . . . 4.1.2.3 Limitations: . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2.4 Final version of DCSE . . . . . . . . . . . . . . . . . . . 4.1.2.5 Discussions about the DCSE algorithm . . . . . . . . . Eliminating redundant multiplications . . . . . . . . . . . . . . . . . . . Reordering purely additive terms . . . . . . . . . . . . . . . . . . . . . .

5 Factorization using Pattern Recognition 5.1 5.2 5.3 5.4 5.5

Decomposition of constant coecients . . . . Algorithm to search for factorizable patterns . Application to DFT8 . . . . . . . . . . . . . . Application to DFT16 . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . .

Conclusions and Perspectives

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

52 53 53 54 55 58 59 63 64 67

69

69 71 72 75 78

79

79 80 80

80 85 87 90 92 92 94 96 103 104 105 112

115 115 119 123 124 128

131

Contents

5

A General Analysis of Linear DSP transforms A.1 A.2 A.3 A.4

Discrete Cosine Transform DCT . Discrete Sine Transform (DST) . Discrete Fourier Transform . . . Walsh Hadamard Transform . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

135 135 137 137 138

Personal Contributions

141

Glossary

143

Bibliography

148

List of gures

149

6

Contents

Introduction In 1965, Gordon E. Moore made the following observation regarding the complexity of Integrated Circuits (ICs) "The complexity for minimum component costs has increased at a rate of roughly a factor two per year" [Moo65]. This doubling of transistor count every two years, known as The Moore's Law, is a self-fullling prophecy that has been driving the semiconductor industry for last fourty years. Dierent interpretations of this law have been made in the past to dene what was meant behind the term complexity and probably there is no just one answer. In particular, an article of David E. Liddle (U.S. Venture Partners) entitled The Wider Impact of Moore's Law has caught my attention. According to the author, Gordon Moore was describing "an economic rather than technological phenomenon". In a sense, respecting the Moore's law has become a necessary condition for the big semiconductor companies to stay alive. To date most of the major companies succeeded in meeting this challenge thanks to large improvements in the fabrication process (lythographic methods, voltage scaling, etc. ). This rule has enabled the industry to provide ever more complex circuits. However the advancement of semiconductor technology has largely outperformed the progress of Electronic Design Automation (EDA) tools and design methodologies. The result is a deeper and deeper productivity gap between between the available technological availability and what can be realised in practice. In 1999, the Semiconductor Industry Association have described in its roadmap the evolution of the design complexity versus designer productivity, illustrated with the chart shown in Fig 1. With growing time-to-market pressures, the demand in terms of eciency becomes more and more important. The design community agrees that the key elements to achieve a quantic jump in term of productivity are: ˆ Design Reuse (Intellectual Property, "IP"). ˆ Shifting the design process to a higher level of abstraction. (Such as High Level Synthesis "HLS"). The second item addresses the need for developing HLS and verication tools which is the topic of this dissertation. Synthesis is a process that translates a behavioral description (control, loops, complex operations,etc.) into a structural representation in form of an architecture (composed of registers and basic operations). This is similar to the compilation process which translates C/C++ code into machine code directly executable on a processor in form of assembly instructions. The revolution that happened in synthesis 7

8

Introduction

Figure 1  Design Productivity Gap is similar to the one that happened in the past in software compilation. Raising the level of abstraction allows one to describe the functionality with a language close to the human one thus helping designers to reduce the design time. Nevertheless, the main weak point of this approach is that most of the time, the code generated by compilers or the architecture generated by synthesis tools are not of sucient quality. As a result, a trade-o between design time and eciency of the solution has to be found. In practice, high level synthesis techniques do not produce circuits whose performance can compete with manual results, and hand tuning is still necessary to obtain competitive designs. Such hand-tuning is extremely time consuming and without guarantee of success due to the high complexity of actual design. This is particularly the case in computation intensive application, such as telecommunications, multimedia and digital signal processing, where the quality of synthesis or compilation results are strongly related to the quality of the initial specication. According to [GGDN04], poor synthesis results are due to a lack of optimization at the language level. In fact, most of the transformations are performed at the operation level, which seriously limits the scope of the synthesis or compilation results. In contrast, performing the optimisation at the source level provides profound modications of the initial description and allows to obtain higher performance results. As a rule of thumb used in computer programming, the Pareto principle states that in most of the programs 90% of runtime is spent in 10% of the code [JHK+ 05],[con08a]. Even if the exact ratio can vary, the reality is that a small portion of the program is often responsible for most of the program runtime. It is particularly the case when the code contains many loops. This thesis focuses on computation intensive applications. Examples of such designs include Digital Signal Processing (DSP), telecommunications, embedded and multi-

Introduction

9

media applications. Computationally intensive tasks in these applications are performed by discrete signal transforms, such as Discrete Fourier Transform (DFT), Discrete Cosine Transform(DCT), Discrete Wavelet Transform (DWT), Walsh-Hadamard Transform (WHT), etc.) represented as polynomial expressions. Thus, to reduce program runtime or hardware resources needed to execute the application, design automation tools and compilers have to focus their eorts on reducing complexity of these application kernels. This is exactly the goal of the approach proposed in this work. In this thesis, an original contribution to pre-synthesis and pre-compilation phases of the design process is proposed. This works aims to establish a new design ow to optimize the initial description prior to synthesis or compilation. New methods, based on a canonical graph representation, called Taylor Expansion Diagram, are presented to support high level of abstraction without loosing eciency. This thesis is organized as follows. The rst chapter presents traditional techniques used in compilation to reduce and optimize the number of computations. Then, some domain-specic tools are described and their advantages and weak points are discussed. The second chapter reviews the concept of Taylor Expansion Diagrams (TED) and explains its construction principles and properties. Chapter 3 presents a contribution to the TED formalism allowing to encapsulates constant coecients into symbolic variables thus making TED independant of the precision needed for the computation. In this chapter, we show also how this improvement in the TED formalism allows to support mathematical operations in the complex domain. Chapter 4 investigates new methods based on TED to optimize the computation complexity. In particular, it describes a novel method to perform common subexpression elimination in a purely symbolic fashion. Chapter 5 presents a novell and original approach that aimed to factorize polynomials. This method attempts to mimic the radix-2 principle by recognizing factorizable cycle in a TED graph. Experimental results obtained by using the formalism of TED based optimization on a number of practical examples are given all along this thesis. Finally, the Conclusion chapter addresses the impact of the methods and discusses future research directions.

10

Introduction

Chapter 1

State of The Art This chapter describes the state of the art in optimization methods used by software compilers and by front end compilation used by High Level Synthesis (HLS) tools. The chapter is organized in two main sections. The rst one deals with the compiler-level optimizations, such as code motion, dead code elimination, loop unrolling, etc., while the second one describes domain-specic techniques and tools. The goal of such an optimization is to decrease the software runtime or minimize the amount of resources needed during computation and to help the designer/programmer to produce a code or a design that satises the required constraints.

1.1 Compiler-level optimizations This section briey reviews the optimization techniques used by software compilers to improve eciency of the generated software. The current techniques are rst described then, the capabilities of the GNU Compiler Collection (GCC) are examined with a case study example.

1.1.1 Current methods used in compilers Several compiler level optimizations are employed in modern compilers. Optimization techniques, such as dead code elimination, common subexpression, loop unrolling, etc., are briey described in this section. More details about compilers optimizations can be found in [ASU86]. 1. Dead Code Elimination This technique analyzes the computation ow in order to remove the part of the code which does not aect the outputs of the program. In particular, it removes the unreachable code (i.e. code that is never executed) and also all parts of the code that rely on dead variables. The code below shows a simple example with dead and unreachable code. In this example, the compiler notices that due to an unfeasible condition a==0, the if statement is unnecessary. It also deletes the instruction written after return since this instruction is unreachable; then 11

12

chapter1 it deletes the declaration int b=12, since this variable is not used in the code anymore. 1 int test_DCE () { 2 int a =5; 3 int b =12; /* Dead variable */ 4 int c; 5 if (a ==0) /* Dead code */ 6 { b =4; } /* Dead code */ 7 c=a +4; 8 return c; 9 b =11; /* Unreachable code */ 10 } 2. Common Subexpression Elimination (CSE) This optimization technique searches for redundant expressions in the code and replaces them with a unique variable holding the computed value, as shown in the following example: 1 X1 =a *b+ c; 2 X2 =a *b* d; /* The computation of a* b is redundant * */ This can be rewritten as: 1 tmp =a *b; /* An intermediate variable is created */ 2 X1 = tmp + c; 3 X2 = tmp * d; Assuming that the execution of the store operation to handle the tmp variable is negligible compared to the MPY operation, it is worthwhile to replace all the instances of expression a*b by its intermediate variable tmp. Nevertheless, the optimization kernel needs to take care of the number of temporary variables created to hold such values; an excessive number of temporary values may result in spilling registers to external memory, which may take longer to execute than recomputing an arithmetic result when it is needed. 3. Loop Unrolling Loop unrolling is a technique that can partially or totally unroll a loop subject to the pipeline capabilities of the target processor taking into account an overhead implied by jump instructions and conditional branches of the loop iteration. It consists in duplicating the loop body a number of times equal to the number of iterations. This decreases the cost of the loop control and increases the parallelism of instruction, potentially exploited by processors.

Compiler-level optimizations

13

1 # define iter_max 50 2 3 for ( int i =0 , i < iter_max , i ++) { 4 Sum [i ]= Tab [ i ]+ k; 5 } If the target architecture is able to compute ve additions in parallel, the code can be rewritten as above so that, instead of fty iterations (i.e. jump and conditional branch instructions), the algorithm needs only ten iterations. 1 # define iter_max 50 2 3 for ( int i =0 , i < iter_max , i=i +5) { 4 Sum [i ]= Tab [ i ]+ k; 5 Sum [i +1]= Tab [ i +1]+ k ; 6 Sum [i +2]= Tab [ i +2]+ k ; 7 Sum [i +3]= Tab [ i +3]+ k ; 8 Sum [i +4]= Tab [ i +4]+ k ; 9 } Moreover, the CSE optimization can then be executed to reduce the address computation. Indeed, the terms i+1, i+2, etc., can be computed only once and stored as intermediate variables. The algorithm can then be rewritten as follows: 1 # define iter_max 50 2 unsigned int tmpadd ; 3 for ( int i =0 , i < iter_max , i=i +5) { 4 Sum [i ]= Tab [ i ]+ k; 5 tmpadd =i +1; 6 Sum [ tmpadd ]= Tab [ tmpadd ]+ k ; 7 tmpadd =i +2; 8 Sum [ tmpadd ]= Tab [ tmpadd ]+ k ; 9 ttmpadd =i +3; 10 Sum [ tmpadd ]= Tab [ tmpadd ]+ k ; 11 mpadd =i +4; 12 Sum [ tmpadd ]= Tab [ tmpadd ]+ k ; 13 } Such an optimization can create new opportunities for other simplications (such as additional CSE). 4. Loop Invariant Code Motion This operation consists of moving invariant instructions outside of the loops to reduce the number of times it will be executed, thus providing a runtime speedup. Considering the following code, two optimization methods can be applied.

14

chapter1 1 while (j < maximum -1) 2 { 3 a =5; 4 j = j + a + (4+ tab [k ]) * pi +5; 5 } First, the assignment of variable a=5 does not need to be repeated at each iteration of the loop, but can be done only once during initialization. Then the computation of maximum-1 and a+(4+tab[k])*pi+5 can be moved outside the loop, and precomputed, resulting in the following code: 1 2 3 4 5 6 7

int maxval = maximum -1; int a =5; int calcval = a + (4+ tab [ k ]) * pi +5; while (j < maxval ) { j = j + calcval ; }

5. Constant Folding Constant folding is a compiler optimization technique in which arithmetic instructions that always produce the same result are replaced by their result. This optimization can only be performed when the instructions in question can be shown to produce the same result at a compile time. Basically, the compiler seeks any operation that has constant operands and no side eects, computes the result and replaces the entire expression. 6. Copy Propagation Copy propagation replaces all the instances of direct assignments with their respective values, as shown in the next example: 1 int compute ( int A , int B , int C){ 2 int D; 3 B= A; 4 C= B; 5 D= C; 6 return D; 7 } After copy propagation optimization the function compute becomes: 1 int compute ( int A){ 2 int B ,C ,D; 3 B= A;

Compiler-level optimizations

15

4 C=A ; 5 D=A ; 6 return D; 7 } The above code can be then optimized by applying dead code elimination: 1 int compute ( int A) { 2 return A; 3 } It also allows to aid constant folding as shown in this sample: 1 B =4; 2 C =1+ B; Which can be replaced by 1 B =4; 2 C =5; Most of these optimization techniques are performed by compilers during the analysis of the lexical and syntactic graph via a Static Single Assignment (SSA) representation or a Directed Acyclic Graph (DAG). However, even if in theory these methods seem to be perfectly integrated in modern tools, the next section shows that in some cases compilers fail to eciently eliminate redundancy.

1.1.2 Compiler optimization with the GNU Compiler Collection (GCC) - Case study After having reviewed the most popular methods in compiler optimization, it is interesting to evaluate the eciency of compilers to simplify and optimize the code. For the purpose of this study, a trivial program have been compiled with the GCC 4.1.2 compiler. The compilation ags have been set to turn on the optimization option. Without any optimization, the compiler's goal is to reduce the compilation time, whereas turning on the optimization ag forces the compiler to improve the runtime performance and/or code size at the expense of compilation time and the ability to debug the program. The optimization ag have been set to -O3, which means that all the optimizations were invoked. To evaluate the eciency of the code during and after compilation, the following commands were used: ˆ gcc -O3 -fdump-tree-all-details testX.c to have a view of the code after optimizations. ˆ gcc -S -O3 testX.c to generate the assembly code.

16 1 2 3 4 5 6 7 8 9 10 11 12 13

chapter1

/* CSE test program for GCC */ # include < stdio .h > int main ( int argc , char * argv []) { int X1 , X2 , Y , a , b , c ; a = atoi ( argv [1]) ; b = atoi ( argv [2]) ; c = atoi ( argv [3]) ; X1 =a* a+a *b+ a* c+b *c; X2 =( a +b) *( a +c ); Y=X1 - X2 ; printf (" The value of Y is :% d" ,Y); // Y is equal to 0 return 0; }

In this program, we can notice that X1 and X2 are equivalent, resulting in Y= 0. It is therefore unnecessary to compute output Y . A dump of this code after being passed through all the optimization stages of GCC is shown in the following listing: 1 ;; Function main ( main ) 2 main ( argc , argv ) 3 { 4 int c ; 5 int b ; 6 int a ; 7 int D .1903; 8 9 : 10 a = atoi (*( argv + 4 B)) ; 11 b = atoi (*( argv + 8 B)) ; 12 c = atoi (*( argv + 16 B) ); 13 D .1903 = a + b ; 14 printf (&" The value of Y is :% d "[0] , a * ( c + D .1903) + b * c - D .1903 * a + c) ; 15 return 0; 16 } As we can notice, a subexpression for (a+b) have been recognized (line 13) by the compiler and reused two times (line 14) in the code. The associated assembly code for pentium processors with full optimizations is as follows: 1 . file " test1 .c" 2 . section . rodata . str1 .1 ," aMS " , @progbits ,1 3 . LC0 : 4 . string " The value of Y is :% d" 5 . text

Compiler-level optimizations 6 . p2align 4 , ,15 7 . globl main 8 . type main , @function 9 main : 10 leal 4(% esp ) , % ecx 11 andl $ -16 , % esp 12 pushl -4(% ecx ) 13 pushl % ebp 14 movl % esp , % ebp 15 subl $24 , % esp 16 movl % ecx , -16(% ebp ) 17 movl % ebx , -12(% ebp ) 18 movl % edi , -4(% ebp ) 19 movl % esi , -8(% ebp ) 20 movl 4(% ecx ) , % esi 21 movl 4(% esi ) , % eax 22 movl % eax , (% esp ) 23 call atoi 24 movl % eax , % edi 25 movl 8(% esi ) , % eax 26 movl % eax , (% esp ) 27 call atoi 28 movl % eax , % ebx 29 movl 16(% esi ) , % eax 30 movl % eax , (% esp ) 31 call atoi 32 leal (% edi ,% ebx ) , % ecx 33 movl $. LC0 , (% esp ) 34 leal (% eax ,% ecx ) , % edx 35 imull % edi , % edx 36 imull % eax , % edi 37 imull % eax , % ebx 38 imull % edi , % ecx 39 addl % ebx , % edx 40 subl % ecx , % edx 41 movl % edx , 4(% esp ) 42 call printf 43 movl -16(% ebp ) , % ecx 44 xorl % eax , % eax 45 movl -12(% ebp ) , % ebx 46 movl -8(% ebp ) , % esi 47 movl -4(% ebp ) , % edi 48 movl % ebp , % esp 49 popl % ebp

17

18 50 51 52 53 54

chapter1

leal -4(% ecx ) , % esp ret . size main , . - main . ident " GCC : ( GNU ) 4.1.2 ( Ubuntu 4.1.2 -0 ubuntu4 )" . section . note . GNU - stack ,"" , @progbits

Evaluating the quality of results produce by the compiler is not an easy task. Nevertheless, this example clearly demonstrates that, even in this simple case, the compiler didn't succeed in full algebraic optimization of the computation. Even though GCC has recognized a+b as a subexpression to be extracted, it was not able to do the same for a+c. As a result, GCC is not able to consider X1 and X2 as mathematically equivalent. Therefore as shown in the assembly code, line 35 to 40, the program will use 4 multiplications to realise the computation which always results in zero. Other optimizations techniques, such as tree height reduction, peephole optimization [TvSS82], pipelining and others [Mas87] can be performed, but such methods are often technology dependent and are not as generic as the ones described above. After having briey reviewed the most commonly used compiler-level optimization techniques, the next sections will present methods which rely on the knowledge of the application domain.

Domain/Target Specic Optimizations

19

1.2 Domain/Target Specic Optimizations Conceptually, compilers are an ideal solution to performance tuning, since the source code does not need to be rewritten if the target is modied. However, compilers often generate suboptimal code even for simple problems as shown earlier. This is the main reason why most of the developers of specialized applications use high performance libraries such as FFTW [FJ05], BOOST [Kar05], etc. As an example, for a matrix multiplication, the code generated by compilers without such libraries is several times slower than the best hand-written code [YLR+ 03]. We now present the most famous generators or hardware/software compilers that address such problems. 1. GAUT GAUT [UBS] is an academic High-Level Synthesis tool dedicated to DSP applications developed at the Lab-STICC laboratory from UBS. Starting from a functional description of the design written in C, GAUT creates a CDFG to extract potential parallelism in the design. This is followed by traditionnal architectural synthesis steps including allocation, scheduling and binding. GAUT requires specication of the following parameters: the cadence (throughput), clock period and the target technology (FPGA families, etc. ). Optional design constraints are I/O timing and memory mapping. GAUT synthesizes a potentially pipelined architecture composed of a processing unit, a memory unit, a communication and multiplexing unit. GAUT generates an IEEE P1076 compliant RTL level VHDL le. This le provides an input for commercial logic synthesis tools such as ISE/Foundation from Xilinx, Quartus from Altera or Design Compiler from Synopsys. As PhD student in the Lab-STICC laboratory, I had a full access to GAUT and was able to obtain all the information needed to run successfull experimentation. For this reason, all the experiments performed with GAUT and described in this thesis are more documented. GAUT uses the GCC compiler as a front end to generate a internal CDFG representation of the design. However, it does not perform any signicant optimization of the initial specication at this level. To illustrate this point, consider a discrete Cosine Transform (DCT), used frequently in multimedia applications. DCT is an important element of all the compression algorithms, and in the JPEG format. The DCT of type 2 is dened by:

Y (j) =

N −1 X k=0

xk cos[

π 1 j(k + )], k = 0, 1, 2, ..., N − 1 N 2

(1.1)

The computation of DCT2 can be represented in matrix form as y = M ·x, where y and x are the output and input vectors, and M is the transform matrix composed

20

chapter1 of the cosine terms (coecients) 1.2.  cos(0) cos(0) cos(0) cos(0)  cos( π ) cos( 3π ) cos( 5π ) cos( 7π ) 8 8 8 8 M=  cos( π ) cos( 3π ) cos(5 π ) cos(7 π ) 4 4 4 4 7π π 5π cos( 3π ) cos( ) cos( ) cos( 8 8 8 8 )

   

(1.2)

The following code in Listing 1.1 is a direct translation in C of the DCT2 of size 4 (DCT2-4) dened as a matrix vector product. 1 # define N 4 2 typedef int matrice [N *N ]; 3 typedef int vecteur [N ]; 4 5 static matrice coeff = {10000 ,10000 ,10000 ,10000 ,9238 ,3826 , -3826 , -9238 ,7071 , -7071 , -7071 ,7071 , 6 3826 , -9238 ,9238 , -3826}; 7 int main ( const vecteur X , vecteur Y ) 8 { 9 int tmp ; 10 int i , j; 11 for (i =0; i 0)},

i=1

E˜i = {e ∈ E : i = ord(e) > 0, value(e) 6= 1}

TED Properties

51

Such dened costs represent the upper bounds on the number of additions and multiplications. The nal cost function, reecting the total cost of the operations encoded by a TED, is obtained by weighting these two cost functions according to their relative cost in the nal hardware implementation. If, for a given technology, the hardware cost of an adder in terms of silicon area is for example, the quarter of that of a multiplier, the global cost function becomes:

φADD + φM P Y 4 It should be noted that the practical DSP applications addressed in this thesis can be represented as linear TEDs (encoding linear multi-variate polynomials), where each variable appears in power k = 1. Therefore for these TEDs the computation of multiplicative cost is reduced into: φglobal =

φM P Y =| E1 | + | E˜1 |

such that

E1 = {e ∈ E : ord(e) = 1)}, E˜1 = {e ∈ E : ord(e) = 1, value(e) 6= 1}

The following Figure 2.21 shows the correlation between the number of TED nodes and edges for a DCT2-4 for all (7!) possible variable orders. For the purpose of this experiment, the hardware cost of a multiplicative edge has been set equal to four times the one of a additive edge. 60 55

φglobal

50 45 40 35 30 25

14

15

16

17

18

19

20

Number of nodes Figure 2.21  Repartition of the global cost function and the number of nodes for a DCT2-4 for all variable orders.

52

Taylor Expansion Diagram

Again, it is important to emphasize that φglobal is a reachable upper bound of the complexity of the TED in terms of the number of operations. While it reects the complexity of hardware needed to implement all the operations, it does not necessarily represent the actual cost of nal hardware implementation after additional optimizations, operation scheduling, resource allocation, synthesis, etc. Nevertheless, φglobal gives a reachable upper bound and is easy to compute, and as such is a reasonable metric that can be used for the purpose of TED-based synthesis. Finally if the synthesis tool support the use of shifters for constant multiplication, the multiplicative cost can be decompose into a multiplicative cost for variables φM P Y and P a multiplicative cost for constants φCON ST corresponding respectively to φM P Y = ki=1 i· | Ei | and | E˜i | as dened previously. Nevertheless, φCON ST denes only the number of multiplication with constant encoded in the TED and from this value, it is not possible to obtain directly the number of shifters needed in the nal architecture.

2.4.7 Normal Factored Form Ren [Ren08] introduced an important concept of Normal Factored Form (NFF) as a unique representation of an expression encoded by a TED. It is based on the following observation: Given a TED (with xed variable order), several factored forms can be derived that match the structure of the TED. These factored forms, while functionally equivalent, may dier in the ordering of terms and ordering of variables inside the terms. By xing the ordering of variables and operations to be compatible with that in the TED, one such form can be chosen as a representative expression for the TED. Such an expression is called Normal Factored Form.

Denition 2.4 A factored form of an expression encoded by a TED is called Normal

Factored Form for that TED if the ordering of variables in such a form is compatible with that of the TED. As an example, consider a TED in Figure 2.22. The following factored forms can be derived from this TED: (a + b)c, (b + a)c, c(a + b) or c(b + a). In addition to being functionally equivalent (a feature of canonical TED representation), they are also structurally equivalent in the sense that they have the same operators involving the same operands. That is, there is exactly one addition, involving variables a and b, and one multiplication, involving the result of the addition (a + b) and variable c. In other words, the DFG constructed from all these factored forms will be identical up to the ordering of operands for each operation (e.g., a + b vs b + a). In contrast, expression ac + bc is not a factored form for this TED, since it involves two multiplications, ac and bc. It has been shown that such a form is unique for a TED with xed variable order and that it can be obtained using any of the TED decomposition procedures described in [Ren08].

DFG Generation: Transformation of Function into Structure

53

Figure 2.22  TED and its Normal Factored Form F 0 = (a + b) · c.

2.5 DFG Generation: Transformation of Function into Structure The turning point in TED-based synthesis and optimization was the realization that TED, which serves a functional representation of a computation, can be used to directly generate a structural representation in form of a data ow graph (DFG), suitable for nal hardware representation. It has been shown by Askar et.al [Ask06][CAGP+ 07] that this functional-to-structural transformation can be obtained by performing functional decomposition of the TED graph. This section reviews the major principles of this method.

2.5.1 Cut-based TED decomposition The TED decomposition proposed in [Ask06, CAGP+ 07] is a procedure which iteratively partitions the original TED into subexpressions by applying a series of cuts. The resulting subexpressions are subsequently partitioned into smaller TEDs by applying new cuts, until trivial expressions (constants) are reached. Each time a cut is applied, a hardware operator, add or mult, is introduced to perform the required operation (addition or multiplication) on the two subexpressions. This way, functional TED representation, composed of algebraic operations, is transformed into a structural representation, with hardware operators. The structural representation obtained as a result of such a decomposition is a familiar data ow graph (DFG), whose nodes represent hardware

54

Taylor Expansion Diagram

operators, and edges represent the processed data (operands and computed results). Since polynomials can be represented as a combination of addition and multiplication, two types of cuts are introduced: additive and multiplicative, dening disjunctive and conjunctive decomposition, respectively. In the following, it is assumed that the TED is linear, i.e., each TED node has only two children, corresponding to 0-order (additive) and rst order multiplicative edges. This restriction is particularly important for conjunctive decomposition, which is not well dened for general, nonlinear TEDs.

2.5.1.1 Additive and Multiplicative Cuts Disjunctive decomposition introduced by Askar [Ask06] relies on a fundamental property of a TED, which states that the function evaluated at a given node is computed as a sum of the functions of its children. An additive cut at a node splits the node into two, which decomposes the expression rooted at this node into two terms. One term corresponds to a 0-child (along the additive edge) and the other to a 1-child (pointed to by a 1-edge) multiplied by a variable corresponding to the split node. The two terms are combined in the structural DFG representation with an add operator. (One can argue that the disjunctive decomposition should be associated with an edge rather than a node, see Section 2.5.2). An example of additive cuts is given in Figure 2.23(a). Cut A3 partitions the root function P = (G + H) + F · (I + J) into (G + H) and F · (I + J). Similarly, A1 partitions function (G + H) rooted at node G into two trivial expressions, G and H , etc. A3 F

A1

F

G

F

G A2

H

G

H

H

M1 I

I

I

J 1

a)

−1

J 1

1

b)

1

J

J

1

1

c)

Figure 2.23  Admissible (A1, A3) and inadmissible (M 1, A2) cuts at the rst stage of TED decomposition: a) TED with a set of candidate cuts; b) TED decomposition after applying cut M 1; c) TED decomposition after applying cut A2.

Conjunctive decomposition is based on a notion of a dominator. Dominator in the TED graph is a node which belongs to every path from the TED root to terminal node 1. A multiplicative cut at such a node separates the TED conjunctively into two

DFG Generation: Transformation of Function into Structure

55

subexpressions: one, corresponding to the subgraph above the dominator (with the dominator node replaced by 1), and the other, corresponding to the subgraph rooted at the dominator. The two subexpressions are combined in the resulting DFG by a mult operator.

Figure 2.24  TED of the function f = (a + b)c with variable order a, b, c An example of a dominator is shown in Figure 2.24 for function f = (a + b) · c. This TED has a single dominator, c, since all paths from root to terminal node 1 pass through this node. By cutting this TED at the dominator c, the function can be decomposed disjunctively as a product of (a + b) and c. (Note: in his original work, Askar [Ask06] actually associated multiplication with an edge of a hierarchical graph and provided additional extraction procedure to group additive terms in a single subgraph). Cut M 1 in Figure 2.23(a) is an example of a multiplicative cut. It decomposes the right subgraph of the TED (representing Q = F (I + J), obtained after applying cut A3) into two expressions to be multiplied together: F and (I + J).

2.5.1.2 Admissible Cut Sequences The rst step of cut-based decomposition procedure is to identify a set of additive and multiplicative cuts in the TED. Since one of the goals of architectural synthesis is to minimize the number of hardware operators to implement the function, only those cuts that do not increase the number of operations during the cut-based decomposition are

56

Taylor Expansion Diagram

allowed in the procedure. At this point, the exact number of operations needed to implement the function is xed and determined by the number of cuts of the corresponding type. For example, the TED in Figure 2.23(a) requires three adders, represented by additive cuts A1, A2, A3, and one multiplier, represented by a multiplicative cut M 1. Nevertheless, while the number of operations for a given TED is xed by number of cuts, the structural arrangement of the operators in the resulting DFG depends on the sequence of cuts applied during the TED decomposition. Finding such a sequence of the cuts is the subject of the second step of the decomposition. In particular, nding a sequence of cuts that minimizes some objective function, such as design latency, is sought. To address this issue, Askar introduces a notion of admissible cuts. An additive (multiplicative) cut is called admissible if it partitions the expression encoded in the TED into exactly two subexpressions in an additive (multiplicative) way, and does not increase the number of operations determined by the original TED representation. Otherwise the cut is called inadmissible. Cut M 1 in Figure 2.23 is an example of inadmissible cut for this TED at this stage of the decomposition. The application of this cut to the original expression divides the original TED into three subexpressions: G+H , F , and (I +J), shown in Figure 2.23(b). It can be shown that a multiplicative cut applied at node i is admissible if and only if node i is a dominator. Note that node I in Figure 2.23 is not a dominator, since there are paths from root to terminal node 1 that do not pass through this node. Similarly, cut A2 in the gure is inadmissible because it increases the number of operators from 4 to 6, as shown in Figure 2.23(c). A necessary condition for an additive cut to be admissible, is that the corresponding node be either a root of the TED or it has only additive incoming edges. Cuts A1, A3 satisfy this condition and are admissible. It has been shown that admissibility is a dynamic property; a cut can be inadmissible for a T ED(k) during the decomposition step k , and become admissible for one of the subgraphs of T ED(j), at some step j > k . Finding a sequence of cuts, such that each cut is admissible in the corresponding decomposition step, is one of the main optimization tasks addressed in the work of Askar. In our example in Figure 2.23 cut M 1 is not admissible as the rst cut. However, it becomes admissible after applying cut A3. Subsequent application of M 1 makes cut A2 admissible, etc. Figure 2.25 shows the resulting subexpression TEDs for the sequence (A3, M 1, A2, A1). In the work of Askar the admissibility of cuts has been captured in a Precedence Graph (PG), which encodes the precedence relation between the cuts during the TED decomposition process. The nodes represent cuts, and a directed edge Ci → Cj between two nodes indicates that cut Ci must be applied before cut Cj in order for the cuts to be admissible. Satisfaction of the precedence relation is a necessary condition for valid decomposition that leads to a DFG structure with two-operand operators, add and mult. By construction, the Precedence Graph is a directed acyclic graph (DAG). A topological ordering of the graph nodes gives a required sequence of decomposition cuts. Several orderings may exist which satisfy the precedence relation, each representing dierent structure and hence resulting in a dierent data ow graph (DFG). Figure 2.26(a) shows the DFG corresponding to the cut sequence (A3, A1, M 1, A2), applied to the TED example in Figure 2.25. Incidentally, other sequences, namely

DFG Generation: Transformation of Function into Structure

57

A3 F

A1 G

A2

H

M1

A3

I J 1

+

G

F

H

A1

A2 M1

I

M1

J 1 1

+

*

A2 I J

F

A2

1

G H

+ I

1

1

J 1

1 1

Figure 2.25  An admissible sequence (A3, M 1, A2, A1) and the resulting TED decomposition.

(A3, M 1, A1, A2), and (A3, M 1, A2, A1), result in the same data ow graph. The issue of nding classes of equivalent sequences that can map onto the same DFG is discussed in [Ask06]. Figure 2.26(b) shows the DFG obtained with the cut sequence (A1, A3, M 1, A2). It can be seen that the two DFGs have dierent characteristics in terms of the design latency, tree balance, etc. In summary, the transformation of the TED into an optimum DFG structure consists of nding an optimum sequence of cuts which satises the precedence constraint and optimizes some objective (in this case latency of the resulting DFG). A branch and bound algorithm has been used in [Ask06] for this purpose.

58

Taylor Expansion Diagram A3

A1

A3

M1

A1

*

G A2

G

H

F I

a)

J

*

H

M1 A2

F

b)

I

J

Figure 2.26  Data ow graphs (DFG) obtained from TED of SG2 by applying different cut sequences: a) DFG for sequences (A3, A1, M 1, A2), (A3, M 1, A1, A2), or (A3, M 1, A2, A1); b) DFG for the sequence (A1, A3, M 1, A2).

2.5.1.3 Limitations of Cut-based Decomposition The cut-based decomposition described in this section applies to TEDs for which at every decomposition step there exists at least one admissible cut, i.e., which have at least one admissible cut sequence. We refer to this as simple decomposition property (see Section 2.5.2). However, there are TEDs, such as the one shown in Figure 2.5.1.3 which do not have such a property. Indeed, the TED for function F0 = (a + b) · (c + d) + d has no dominator and hence no multiplicative cut. Neither does it have an additive cut that would permit to disjunctively partition the graph into two terms (nodes a, b, c cannot be split additively). To address this issue, hierarchical approach has been proposed in [Ask06]. This has been done by extracting maximal subexpressions and representing them as nodes in a top-level (global) TED, until a simple decomposition is possible. Each of the subgraphs is then subjected to iterative and hierarchical decomposition. For the TED in Figure 2.5.1.3 the extracted subexpressions are F1 = (a + b), F2 = (c + d) and F3 = d, and the resulting top-level TED is F0 = F1 · F2 + F3 . Such constructed TED has an additive cut A1 at node F1 and a multiplicative cut M1 at node F2 , leading to an admissible sequence, A1 , M1 . The two nontrivial subgraphs F1 , F2 have simple decomposition each, resulting in a nal decomposition. However, the extraction algorithms proposed by Askar do not cover all cases and are not robust, limiting the application of his cut-based method to simple decompositions. The hierarchical approach to TED decomposition is illustrated with a simple example in Figure 2.27. It has been believed that, for linear TEDs, the number of multipliers is dened by the number of 1st-order (multiplicative) edges [Ask06]. This, however, is not true [Ren08]; the number of multipliers depends not only on the number of the multiplicative edges but also on the structure of the TED. This topic is discussed in Section 2.5.2. It has be shown that the number of adders is determined by the number of 0order (additive) edges [Ren08]. Nevertheless, this is not true for multiple output TEDs.

DFG Generation: Transformation of Function into Structure

59

A B C

SG1 D E

F

A

F

SG2

G

B

H

G C

H

K

I

D

J

I E

J

K 1

1

1

1

Figure 2.27  Generation of hierarchical TED: a) original TED for f = (A + B + C + D + E)(G + H + F (I + J)) + K ; b) hierarchical TED obtained by extraction; c) TED of subgraph SG1; d) TED of subgraph SG2. Indeed, consider the TEDs shown Figure 2.28 corresponding to F 1 = a + b + c + d and F 2 = a − b + c + d, the number of 0-order edges is equal to 5, whereas it's possible to recognize that the out-going additive edges of node a and b used the computation of F 2 are negative. It results that the term a can be merged with what is pointed by node b of F 2 (ie. c + d) with a pure additive operator (no subtraction). Therefore a + c + d can be shared for the computation of F 1 and F 2 resulting in S1 = a + c + d, F 1 = S1 + b and F 2 = S1 − b. Finally, we should point out that the cut-based decomposition, described in this section, is applicable only to linear TEDs, representing linear multi-variate polynomials. Such TEDs have only additive and rst order multiplicative edges, and each TED node has at most two children. This limitation has been recently removed in the work of Ren [Ren08], who generalized the application of TED decomposition to arbitrary TEDs by performing TED linearization, as discussed in Section 2.6.1. This method is described next.

2.5.2 Complex TED Decomposition This section reviews basic concepts of TED-based factorization and decomposition of polynomial expressions developed by Ren [Ren08]. The principle goal of these algorithms is to minimize the number of variables (which correspond to operands in hardware representation) and operators (adders and multipliers). An example of factorization is the transformation of expression F = ab+ ac into F = a(b + c), which reduces the number of variables from four to three, and the number of multiplications from two to one. If a particular sub-expression appears more than once in a given expression, it can be eliminated from the expression and replaced by a

60

Taylor Expansion Diagram

Figure 2.28  TEDs of the functions F 1 = a + b + c + d and F 2 = a − b + c + d new variable. This process is known as common subexpression elimination, or CSE , and applies to single and multiple expressions. The simplied expressions are represented as a series of subexpressions and hence is referred to as a decomposition. Decomposition plays particularly important role in simplication of multiple expressions (or multipleoutput functions), since the same sub-expression can be extracted simultaneously from several expressions, which can thus be represented with fewer variables and operations. Factorization and decomposition operations are performed directly on the TED graph, taking advantage of its compact, canonical representation. Since TED-based factorization and decomposition procedures rely on the same principle of nding common subexpressions to be factored out or extracted, they are referred to jointly as TED decomposition. Such a decomposition is achieved by extracting TED subgraphs and replacing them by new expressions, each represented as a TED. The method is based on conjunctive and disjunctive decomposition, similar to that of cut-based decomposition of Askar [Ask06]. Both methods perform conjunctive decomposition based on dominator nodes that dene multiplicative cuts. Also, both of them use subgraph extraction and hierarchical approach during the decomposition, although at dierent stages of decomposition. However, there are some subtle dierences between these two approaches in the way conjunctive decomposition and extraction is performed. Specically, in [Ren08] the disjunctive decomposition is based on identifying additive edges, called split edges, whose removal decomposes the TED into two disjoint graphs. Furthermore, a robust extraction algorithm has been developed to handle classes of TEDs that do not have simple decomposition, i..e., have no dominator and no split edge (or, equivalently, have

DFG Generation: Transformation of Function into Structure

61

no admissible cut sequence). Such a decomposition is referred to as complex decomposition. It has been shown that if the graph does not have simple decomposition, it must be decomposed disjunctively into two non-disjoint subgraphs. This can be illustrated with an example shown in Figure 2.29 for function F = x · (z · u + q · r) + (p · w + y) · r. This function can be represented as a disjunction of two functions F1 + F2 , with F1 = x · (z · u + q · r) and F2 = (p · w + y) · r, which share a common subgraph rooted at node r. The TED does not have a single split-edge that would separate the graph into two disjoint subgraphs; neither does it have a dominator that would allow it to decompose it disjunctively into disjoint graphs (note that r is not a dominator in this graph). Hence the graph does not have a simple decomposition, but can be decomposed disjunctively.

(a)

(b)

(c)

Figure 2.29  Complex TED decomposition for F = x · (z · u + q · r) + (p · w + y) · r: (a) Original TED; (b) Simplied TED after product term substitutions, P1 = z · u and P2 = p · w; (c) Simplied TED after sum term substitution, S1 = P2 + y . Such a non-disjoint decomposition is accomplished in a systematic way on a TED as follows. The rst step is to replace a series of nodes connected by multiplicative edges only by a single variable PI (such set of edges corresponds to a product of variables Πvi ).

62

Taylor Expansion Diagram

This is followed by replacing each maximal set of nodes connected by additive edges and sharing a common node, by a single variable SK (this appear in the expression as sum P of variables vk ). Several iterations of this procedure may be necessary to reduce the TED to the simplest form, where all the product terms and sums terms are replaced by new variables. Such a transformed TED is then subjected to the nal decomposition based on the fundamental Taylor expansion principle. This decomposition procedure is illustrated with the following example for F = x · (z · u + q · r) + (p · w + y) · r in Figure 2.29(a). It is composed of the following three steps. The rst two steps serve as subgraph extraction to produce a TED used for the third step which is the nal decomposition. 1. Product Term Substitution. The following subsets of TED nodes, connected in series by multiplicative edges only, are identied and replaced by new variables: P1 = z · u and P2 = p · w. Nodes u and w, which are dominators in these subgraphs, are referred to as local dominators. The transformed TED is shown in Figure 2.29(b). Note that the product q · r is not replaced by a single variable since the nodes (q, r) do not form a series product term as dened above (node r has more than two edges). 2. Sum Term Substitution In this step the sum terms are identied in the TED and substituted by new variables. A sum term appears in the TED graph as a set of variables incident to the edges that have a common node and are linked together by one or more additive edges. Such patterns can be readily identied by traversing the graph in a bottom-up fashion and creating, for each node v , a list of nodes reachable from v by a multiplicative edge. The procedure starts at the terminal node 1 and traverses all the nodes in the graph bottom-up, in a reverse variable order. The set of nodes reachable from terminal node 1 is {P1 , r}. Since these nodes are not linked by an additive edge, they do not form a sum term in the expression. The next node to be examined is r. The list of nodes reachable from node r is {q, y, P2 }, of which {P2 , y} are linked by an additive edge. Hence they correspond to a sum-term (P2 + y), associated with Figure 2.29(c) the product (P2 + y) · r. No other nodes in this graph have this property and no further simplication is possible. The resulting graph is shown in Figure 2.29(c), with the sum term (P2 + y) substituted by variable S1 . 3. Final Decomposition. Note that the resulting TED in Figure 2.29(c) does not conform to a simple decomposition. We refer to such TED as an irreducible TED. At this point a nal decomposition is performed by a straightforward application of the fundamental Taylor series expansion procedure. The graph is traversed in a topological order, starting at the root node. At each visited node v the expression F (v) is computed as F (v) = F0 + v · F1 , where F0 is the function rooted at the rst node reached

DFG Generation: Transformation of Function into Structure

63

from v by an additive edge, and F1 is the function rooted at the rst node reached from v by a multiplicative edge. Using this procedure, the following expressions are derived for the TED in Figure 2.29(c).

F = F (x) = F (S1 ) + x · F (P1 )

(2.9)

F (S1 ) = S1 · F (r) F (r) = r F (P1 ) = P1 + F (q) F (q) = q · r where, P1 = z · u, P2 = p · w and S1 = (P2 + y), as obtained by previous steps. Note that such a decomposition has a total of 5 multiplications (compare this to the 6 multiplicative edges and 4 local dominators). On a nal note, Ren developed an ecient TED linearization procedure that allows it to decompose nonlinear algebraic expressions. This procedure is discussed in Section 2.6.1. We conclude this section with an important result regarding the number of operations in the TED.

2.5.3 Counting the Number of Operations in TED One way to count the number of operations in the TED is to derive Normal Factored Form, described in Section 2.4.7, and simply count the number of operations involved. This approach requires performing full TED decomposition, as described in the previous section. An alternative method is to analyze the TED graph without performing the decomposition. The method is based on analyzing the nodes with multiple incoming multiplicative edges to identify local dominators. The following observations are made for a TED of a single expression in [Ren08] regarding the number of operations that can be derived by such an analysis: ˆ The number of additions in the Normal Factored Form derived from a TED is xed and equal to the number of additive edges in the TED. Hence, the number of additive edges in the TED directly provides the number of additions. ˆ An upper bound on the number of multiplications in the expression encoded by the TED is equal to the number of non-trivial multiplicative edges in the TED. This is intuitively clear, since each multiplicative edge represents a multiplication (in some, not necessarily minimal or normal form). ˆ The number of multiplications associated with a dominator node is equal to one. This observation follows from the denition of the dominator. All the multiplicative edges in-incident to the dominator node count as one multiplication.

64

Taylor Expansion Diagram ˆ The lower bound on the number of multiplications in an expression encoded by a given TED is equal to the number of internal nodes with one or more incoming multiplicative edges. If each such node is a dominator, this lower bound is achievable. ˆ Finally, the number of multiplications in a linear TED is equal to the number of multiplicative edges in all the irreducible TED graphs.

To illustrate these points, refer again to the TED in Figure 2.29 for F = x · (z · u + q · r) + (p · w + y) · r. This expression has ve multiplications: two of them are associated with local dominators u and w of subexpressions P1 = z · u and P2 = p · w, respectively; two multiplications correspond to the two multiplicative edges incoming into the node r; and one multiplier corresponds to the multiplicative edge (x, P1 ) in the nal TED shown in Figure 2.29(c).

2.6 Other TED-based Representation This section describes several extensions proposed to the TED representation to enhance its application to other domains, such as simulation and verication. 1. Conditional Taylor Expansion Diagrams (CTED): The method proposed by Gharehbaghi et. al [GHE04] extends the TED representation by adding conditional and relational nodes such as ” < ” and ”! = ” to the TED formalism. The idea behind CTED is that conditionals that evaluate to T rue or F alse can be represented as Boolean variables. It is known that TED, in addition to representing algebraic functions over symbolic variables, can also represent logic operations over Boolean variables [CKA06]. However, TED can not eciently represent relational operators without decomposing the operands into bit vectors. By adding the ability to handle relational operators, CTED is able to handle conditions, as well as functions. In this approach, no bit expansion is needed to handle conditional relations, which saves memory and runtime during simulation and verication. 2. Binary Taylor Diagrams (BTED): This modication aims to simplify the TED representation by converting it to a binary structure. Recall that in the original TED representation the number of children at a given node is equal to the degree of the corresponding variable, which can vary between the nodes. This, in turn, complicates the implementation of the internal TED data structure. The representation proposed by [HSA+ 05] takes advantage of several algorithms [BRB90] [PSC94] [SB96] to transform the TED into a binary structure, similar to BDDs, where each node contains exactly two children. Recall the Taylor series of an dierentiable function f (x) around x = 0, dened in Section 2.3.1, expressed by Equation 2.10.

Other TED-based Representation

65

1 1 f (x) = f (0) + xf 0 (0) + x2 f ”(0) + x3 f 000 (0) + · · · 2 3!

(2.10)

By factoring variable x, equation 2.10, can be rewritten to take the following form (known as Horner form):

1 1 f (x) = f (0) + x(f 0 (0) + x( f ”(0) + x( x3 f 000 (0) + · · · ))) 2 3!

(2.11)

BTD applies Equation 2.11 to the root of the function, resulting in a decomposition into two expressions: f (0) as the left child, and (f (x) − f (0))/x as the right child. This decomposition is then iteratively applied to function (f (x) − f (0))/x, resulting in a BTD diagram shown in Figure 2.30. As shown in the gure, each BTD node has exactly two children, which simplies the implementation of the internal data structure. f (x)

X V x

1 f (0)

X V x

1 f 0 (0)

X V 1 f ”(0)/2

x f 000 (x)/3!

Figure 2.30  Decomposition principle in BTD Note that this representation is similar to the one described in Section 2.5.2, which describes TED linearization. It should be pointed out that the name BTD does not implicate that the variables are binary, but rather refers to the binary nature of its structure. 3. Linear Taylor Expansion Diagram (LTED): A version of TED has been proposed by Alizadeh et. al [AN04] with the aim to handle relational expressions. This is achieved by adding operators such as ”E” (equal to zero), ”N E” (not equal to zero), and ”GE” (greater or equal to zero) to the set of TED nodes. The resulting representaion is called Linear TED or LTED. In addition to the standard TED nodes to represent variables and constants, this structure includes nodes of type Branch, Union and Intersect, whose the functionalities are dened as follows:

66

Taylor Expansion Diagram ˆ Branch : The branch node is particularly suitable for multiplexer-based structures. It has has three elds: Select, to represent the binary select expression (evaluating to True or False), and two input elds, InOne, InZero, connected to regular LTED nodes. The following equation indicates the functionality of a Branch node.

F = Select & InOne + Select & InZero

(2.12)

ˆ Union and Intersect : The Union and Intersect nodes are added to the LTED structure to enable the respective set operations on the algebraic expressions. Each of these nodes have two elds, for the two arguments, connected to the LTED nodes. Similarly to the add and mult operators of a TED, composition rules have been dened in LTED for the Union and Intersect operations. An example of LTED node is given in Figure 2.31 for the following if statement: 1 If (a ) 2 Then X 1, are modied according to the value of ord(eout ). Two cases have to be considered here:

 If ord(eout ) is an even number, the node v is deleted and its incoming edges

ein are connected to terminal node 1. This is because j k , with k even, evaluates to ±1. Then, if ord(e2 out ) is an odd number, the weight of the incoming edges ein becomes equal to −value(ein ) · value(eout ). Similarly, if ord(eout ) is an even number it evaluates to value(ein ) · value(eout ) . 2

 If ord(eout ) is an odd number, the node v is preserved. If ord(eout −1) 2

then value(ein ) is multiplied by −1. If remains the same. Finally, ord(eout ) is set to 1.

ord(eout −1) 2

is odd, is even, then value(ein )

Extensions to the original TED representation

77

ˆ Case 2: Now let us assume that the TED graph has been linearized, as described in Section 2.6.1. In this case the Complex Reduction contains the following steps:

 The TED is rst reordered in such a way that every node v , such that

var(v) = jn , with n being the index produced by the linearization algorithm, is placed at the bottom part of the TED and ordered as follows: jn , jn−1 , · · · , j1 .

 Consider all the edges ein connected to the nodes v , such that var(v) = jn ,

with n being an even number. If n2 is an odd number, then value(ein ) is multiplied by −1, else value(ein ) remains unchanged.

 Then the edges ein connected to the nodes v , such that var(v) = jn , with n being an even number, are connected to terminal node 1.

 Finally all the nodes v , such that var(v) = jn , with n being an odd number, take as variable j1 (i.e. var(v) is set to j1 ).

After performing the Complex Reduction, the reduction and normalization process, described in Section 2.3.6, must be performed to ensure the canonicity of the TED. Then the reordering algorithm replaces the variables at their initial positions. The Complex Reduction of a non-linearized TED and a linearized TED are illustrated with Figures 3.7 and 3.8, respectively, for the following example: F 0 = (a + j · b) · ((c + j · d) · (e + j · f )).

Figure 3.7  Complex Reduction of a non-linearized TED for F 0 = (a + j · b) · ((c + j · d) · (e + j · f ))

78

TED for high level synthesis

Figure 3.8  Complex Reduction of a linearized TED for F 0 = (a+j·b)·((c+j·d)·(e+j·f ))

3.3 Conclusion In this chapter several modications have been proposed to improve the TED representation. These changes are particularly interesting for TEDs that handle data intensive applications such as the ones used in telecommunications or signal processing areas. Using symbolic variables makes the TED structure independent of the precision needed for the computations, which is an important feature of this contribution. This modication to the TED formalism is a key element of the patent [CABG] and is the starting point for the optimization methods developed in this thesis. A new node associated to the symbolic variable j has also been introduced to support complex operations with TED. The next chapter is interested in how to reduce the number of operations encoded in a TED take prot of these modications.

Chapter 4

Common Subexpression Elimination In this chapter, we propose several methods to simplify and optimized linear and polynomial expressions using TED. Our objective is to rewrite the initial specication into a factorised structure that minimizes the number of multiplications and additions. The structure of this chapter follows the progression of our work to nd and extract automatically common sub-expressions. After dening the concept of Common Sub-expression Elimination (CSE), we present a rst CSE method based on a static parsing of the TED (Static CSE or SCSE). To overcome the limitation of SCSE, we improve it by allowing a dynamic reordering of the TED variable (Dynamic CSE, or DCSE). Finally, we propose additional methods to further improve the DCSE process and we analyse the performance of the nal algorithm on several DSP examples.

4.1 Common Subexpression Elimination (CSE) This section describes an ecient method to perform common subexpression elimination (CSE) for DSP transforms based on the TED representation. As described previously in Chapter 1, this is the main bottleneck of the currently available tools. The CSE is a well-known method, widely used in compilers to reduce the number of computations realized by the target application. Usually, it is performed during the syntactic and lexical analysis of the specication. It attempts to detect redundant sub-expressions in the data-ow, stores them in a new intermediate variable and reuses the variable wherever the subexpression appears in the specication. Therefore, each redundant term is computed only once. In this thesis we perform the common subexpression elimination directly on a TED as the basic routine to decrease the number of operations encoded in the TED. Note: In this section, ambiguity may arise when referring to a node v with a variable var(v) = x that may be present in several locations in the graph. To resolve this ambiguity, the nodes are referred to with the following formalism: v(variable)position , where variable = var(v) and position is the position of the node counting from left on the var(v) level. For example, v(X1)2 refers to the second node associated with variable X1 from the left. 79

80

Common Subexpression Elimination

4.1.1 Static Common Subexpression Elimination In this subsection we give the denitions of CSE candidate node and trivial node before describing an optimization method called Static Common Subexpression Elimination (SCSE). Finally the limitations of the algorithm are presented which motivate the enhancement of the method proposed in the next subsections.

4.1.1.1 Denitions Before reviewing the Static CSE algorithm on a TED, it is necessary to dene several properties of particular nodes in a TED graph. The aim of CSE is to eliminate redundancy of computation by reusing terms that have already been computed. These terms, called CSE candidates, appear whenever each time a TED subgraph that encodes a computation is referred more than once.

Denition 4.1 Trivial node: A trivial node vt in a TED is an internal node that points only to terminal node 1 with a multiplicative edge and a weight equal to ±1. Such an edge represents a trivial multiplication by ±1 (i.e. the function rooted at this node is var(vt )).

Denition 4.2 CSE Candidate node: A candidate node for Common Subexpression Elimination is a non-trivial node referred more than once.

Denitions 4.1 and 4.2 are used in the SCSE Algorithm 2 to eliminate redundancies from the TED graph.

4.1.1.2 Basic Static Common Subexpression Elimination (SCSE) algorithm Algorithm 2, describes the dierent steps needed to extract the existing redundancies in the TED graph. The following explains the portion of the algorithm that recognizes the candidates and chooses the best one for factorization.

Create a list of CSE candidates. The rst step in performing static CSE is to recognize the existing common terms in the expressions represented by a TED. Such common terms are manifested in the TED graph as candidate nodes (nodes with multiple parent edges). The goal of CSE is to replace the subgraph rooted at a candidate note by a single variable corresponding to that expression. By doing this, we eliminate common terms appearing in several expressions, thus minimizing the overall complexity of the expression encoded by the TED. To illustrate this, let us consider the following set of polynomial expressions:

F 1 = a + b(c + d) F 2 = e + f (c + d)

Common Subexpression Elimination (CSE)

81

Algorithm 2 Function: SCSE() i=0 Create a list L of CSE candidates nodes while L = 6 ∅ do i++ Choose the best candidate node v ∈ L Create a new empty TED structure with a root node T Si . Create a new internal node Si located just above node v Duplicate the subgraph rooted at node v into the new data structure Connect the root node T Si to the topmost node of the duplicated subgraph Create the list P of direct parent nodes of v for all p ∈ P do Redirect the outgoing edges of p to the internal node Si

end for

Delete recursively the nodes in the original TED that have no incoming edge Clear L Create the list L of CSE candidates nodes

end while

The associated TED of F 1 and F 2 with variable order: a, b, e, f, c, d (i.e. with a placed on top) is represented in Figure 4.1.

Figure 4.1  TED of the functions F 1 and F 2.

82

Common Subexpression Elimination

In this TED there are two edges that point to node vc , with var(vc ) = c. This means that the functions rooted at node vb and node vf (corresponding to b ∗ (c + d) and f ∗ (c + d) respectively), share the common subexpression c + d rooted at node vc . The recognition of candidate nodes is performed by analysis of the graph (described by Algorithm 3) which creates a list of candidate nodes for extraction. This list is then examined to choose the best candidate for common subexpression elimination.

Algorithm 3 Function: Create List of candidates() declare a map mP arents keyed with a pointer to TEDNode (to store the children) and a vector of TEDNode (for the parents) for all root nodes φ of the TED graph do for all internal nodes v that belong to the subgraph rooted at φ do for all children (vchild ) of node v do push back v in the vector of mP arent keyed with vchild

end for end for end for for all vchild ∈ mP arent do if mP arent[vchild ].size() > 1 and vchild is non-trivial then insert the child vchild in the candidate list L

end if end for return L

Choosing the best candidate for extraction After generating the list of potential candidates for common subexpression elimination, the static CSE algorithm chooses the rst subexpression to be extracted. The rst heuristic that have been implemented to make this choice was guided by the number of parent edges rooted at a candidate node. Specically, the node with maximum number of parent edges was selected for CSE. The rationale for this heuristic was that the larger the number of parent edges the larger the number of occurrences of a given subexpression in the original expressions, and hence the higher the gain1 obtained by applying CSE. Nevertheless, there may be more than one node with the maximum number of parent edges. This happens typically in DSP transforms due to symmetry in the matrix of coecients, which results in ambiguous situations where the choice of a node is not obvious. This is illustrated in Figure 4.2 where nodes v(X0)3 , v(X0)4 , v(X1)3 , v(X1)4 , v(X2)3 , v(X3)3 are detected as CSE candidates by the algorithm. The rst heuristic allows one to reduce the list to the following nodes: v(X0)3 , v(X1)3 , v(X2)3 , v(X3)3 with four parent edges each. Nodes v(X0)4 , v(X1)4 are rejected because they are re1

The gain is measured as the total number of operations encoded in the TED.

Common Subexpression Elimination (CSE)

83

Figure 4.2  TED of the DFT8 (with each node representing a complex number). ferred only twice. However, the number of references is not a sucient criterion to discriminate these candidates and another metric is needed to choose among the candidates with the same parent edges. Arbitrarily, the variable with the highest position in the graph (level or index) is used as a secondary metric. In this case, the SCSE algorithm chooses the node v(X0)3 as the best candidate. After choosing the best candidate for SCSE extraction, a new root node is created in a separate TED. The subgraph rooted at the candidate is extracted from the original TED, placed in the new TED and rooted to the newly created root node. For the example of the DFT8, the extraction of the rst candidate results in the TED shown in Figure 4.3.

84

Common Subexpression Elimination

Figure 4.3  TED of the DFT after extraction of subexpression X0 − X4. After a complete execution of the algorithm, the following subexpressions have been found for the DFT8:

S1 = X0 − X4 S2 = X1 − X5 S3 = X2 − X6 S4 = X3 − X7 S5 = X0 − X2 + X4 − X6 S6 = X1 − X3 + X5 − X7 The nal TED obtained after performing the SCSE algorithm is given in Figure 4.4.

Common Subexpression Elimination (CSE)

85

Figure 4.4  TED of the DFT8 obtained after performing the SCSE algorithm.

4.1.1.3 SCSE Results and comments The DFT8 implemented as a matrix vector product requires 64 multiplications and 56 additions if all the matrix coecients are represented as symbolic variables. By eliminating trivial coecients ±1, this computation cost is reduced to 32 multiplications and 56 additions. The TED representation with a static (but well-chosen) order of variables reduces the number of operations to 14 multiplications and 38 additions. This reduction of the computation is an inherent feature of the TED representation. The SCSE algorithm allows one to extract redundancies that exist in a TED. Even if this algorithm does not decrease the number of computations encoded in a TED, it guarantees the computation of redundant terms only once. Furthermore, the SCSE algorithm makes the number of operations (needed for the computation of extracted terms) independent of theorder of variables. Nevertheless, even if the previously dened heuristics seem working for the DFT8, one should examine their robustness for other kinds of applications.

86

Common Subexpression Elimination Consider for example, the following set of functions:

F1 = a + e + f ∗ g F2 = b + e + f ∗ g F3 = c + e − f ∗ g F4 = d + e − f ∗ g

(4.1)

The TED associated with functions F 1, F 2, F 3 and F 4 is represented in Figure 4.5. The cost of this TED in terms of the number of operations is 6 additions and 1 multiplication. The list of CSE candidates in Figure 4.5 is {v(e)1 ,v(e)2 ,v(f )}. According

Figure 4.5  TED of functions: F 1, F 2, F 3 and F 4. to the previously dened criteria, there remains an ambiguity in choosing the best candidate node since now we have to choose between the nodes v(e)1 and v(e)2 , which have the same number of parent edges and are placed on the same level. Let us assume that v(e)1 followed by v(e)2 are chosen as the best CSE candidates. This results in the extraction of two common terms: S1 = e + f ∗ g , S2 = e − f ∗ g and the resulting TED is shown in Figure 4.6. The number of operations associated with this TED is 6 additions and 2 multiplications, which requires one multiplication more than the original TED. The reason for this increase in the number of operations is that the product f ·g appears in both extracted subgraphs TED-S1, TED-S2. This, in turn, happens because of the choice of the candidate nodes (in this case v(e)1 and v(e)2 ), placed above the product f · g rooted at v(f ). By choosing, instead, the candidate node v(f ), the number of operations can be reduced. This is discussed next.

Common Subexpression Elimination (CSE)

87

Figure 4.6  TED of functions F 1, F 2, F 3 and F 4, with S1 = e+f ∗g and S2 = e−f ∗g extracted as separate TEDs.

4.1.1.4 Enhanced Static Common Sub Expression Elimination Representing the extracted subexpressions in the original TED To account for the fact that an extracted subgraph could contain candidate nodes indexed at a lower level, the TED data structure created by the CSE algorithm is combined together with the TED of the modied TED after extraction. This way a common term of newly created TEDs (f ∗ g) will be shared. This allows the algorithm to nd common sub-expressions in the already extracted terms. The SCSE algorithm has been modied accordingly to support this feature, as shown in Algorithm 4. For the set of functions F 1, F 2, F 3 and F 4 dened by Equations 4.1, Algorithm 4 results in the TED shown Figure 4.7.

88

Common Subexpression Elimination

Figure 4.7  TED of functions F 1, F 2, F 3, F 4 with S1 = e+f ∗g , S2 = e−f ∗g and in the same TED.

Figure 4.8  TED of functions F 1, F 2, F 3, F 4 with S1 = e + S3, S2 = e − S3 and S3 = f ∗ g in the same TED.

Algorithm 4 Function: SCSE_top-down() i=0 Create the list L of CSE candidates nodes while L 6= ∅ do i++ Choose the best candidate node v ∈ L Create a new root node Si . Create a new internal node Si located just above node v Duplicate the subgrah rooted at node v into the new data structure Connect the root node Si to the topmost node of the duplicated subgraph Create the list P of direct parent nodes of v for all p ∈ P do Redirect the outgoing edges of p to the internal node Si

end for

Delete recursively the nodes in the original TED that have no incoming edge Clear L Create the list L of CSE candidates nodes

end while

Common Subexpression Elimination (CSE)

89

The number of operations encoded in this TED is 1 multiplication and 6 additions, which is equal to the one of the original TED in Figure 4.5. While the ambiguity still exists in choosing among the nodes on the same level, the choice of the best candidate does not aect the number of computations. We can also notice that the use of the second criterion (variable level in the graph) allows the algorithm to nd smaller common terms inside the bigger ones. The common term (f ∗ g ) in Figure 4.7 is extracted using CSE algorithm, as shown in Figure 4.8. This procedure can be viewed as a decomposition performed in a top-down fashion.

Choosing the CSE candidate at the lowest level

An alternative approach to perform such a decomposition is to choose the lowest candidate node as best candidate for extraction. This ensures that its associated subgraph does not contain common terms. By analogy with the top down decomposition, this can be seen as a bottom up composition of common terms, presented in the Algorithm 5.

Algorithm 5 Function: SCSE_bottom-up() i=0 Create the list L of CSE candidates nodes while L = 6 ∅ do i++ Choose the lowest candidate node v ∈ L Create a new root node Si . Create a new internal node Si located just above node v Create the list P of direct parent nodes of v for all p ∈ P do Redirect the outgoing edges of p to the internal node Si

end for

Connect the root node Si to the candidate node Clear L Create the list L of CSE candidates nodes

end while

For the set of functions F 1, F 2, F 3 and F 4 dened by Equations 4.1, this bottom up approach will rst choose v(f ) as the best candidate at the rst iteration. The TED after complete extraction of common terms is shown in Figure 4.9. The cost associated with this TED representing F 1, F 2, F 3 and F 4 is also 1 multiplication and 6 additions.

90

Common Subexpression Elimination

Figure 4.9  TED of the functions F 1, F 2, F 3 and F 4 with extraction of the lowest candidates as heuristic.

4.1.1.5 Conclusions The initial heuristics developed in this section allow the reader to converge towards the following conclusion: the extraction of common terms must be performed according to the level of the corresponding candidate nodes in the TED graph. The actual implementation of the SCSE algorithm (either in a top-down or bottomup fashion), allows one to extract the common terms that appear explicitly in the TED graph. Notice that the TED cost in the number of operations of an already extracted part of the graph is independent to the order of variables. Let us now evaluate the eciency of the nal Static CSE on a Walsh Hadamard Transform (WHT), which is widely used in communication designs. A straightforward implementation of this transform results in N 2 multiplications and N (N − 1) additions.

Common Subexpression Elimination (CSE)

91

An attentive designer will notice that the coecient of the WHT transform are only composed of terms equal to ±1. Thus a reasonable implementation of the WHT could be realized with only N (N − 1) additions. At the same time TED obtained from the Plog (N ) N N matrix representation of WHT transform, contains i=12 i 2i additions. Table 4.1 shows the number of additions for a direct implementation, the initial TED, the TED after SCSE algorithm, and with a radix 2 decomposition. The radix 2 decomposition is a well-known technique used by designers and automation tools such as FFTW or SPIRAL. WHT4 WHT8 WHT16 WHT32 WHT64 WHT128 WHT256

Direct Implementation 12 56 240 992 4032 16256 65280

Original TED 10 42 170 682 2730 10922 43690

SCSE 10 42 170 682 2730 10922 43690

Radix 2 8 24 64 160 384 896 2048

Table 4.1  Comparison of the number of additions operations obtained with a direct implementation, original TED, TED subjected to a SCSE, and with a Radix 2 decomposition. As shown in the table, the number of operations grows quickly with the size of the WHT transform. Even if the use of the TED representation reduces the number of operations compared to the original form, the radix two decomposition drastically outperforms these results. Moreover, as shown by this example, the SCSE algorithm does not change the complexity in terms of the number of operations as it can nd only explicit redundancies in the TED graph. The following subsection proposes a new algorithm that performs common subexpression elimination and dynamically reorders the TED graph to expose new candidates.

92

Common Subexpression Elimination

4.1.2 Dynamic Common Subexpression Elimination (DCSE) 4.1.2.1 Solving the limitations of SCSE As previously explained, the SCSE algorithm is unable to nd all the redundant computations in a mathematical expression. Only redundant terms that are represented explicitly as non-trivial subgraphs can be found. To illustrate this point, consider a WHT of size 4 (WHT4) represented with the following set of polynomial expressions:

Y 0 = X0 + X1 + X2 + X3 Y 1 = X0 − X1 + X2 − X3 Y 2 = X0 + X1 − X2 − X3 Y 3 = X0 − X1 − X2 + X3 The matrix corresponding to the WHT4 is  1 1 W HT 4 =  1 1

given by Equation 4.2  1 1 1 −1 1 −1  1 −1 −1 −1 −1 1

(4.2)

As already shown in Table 4.1, a direct implementation of the WHT4 results in 12 additions (3 additions for each of the four outputs). By traversing the TED graph, the number of operations needed to compute the WHT4 can be reduced to 10 additions. Nevertheless, it is possible to take advantage of the structured form of this DSP transform. The number of operations needed to compute a WHT4 can be reduced, by recognizing that the terms X0+X1, X0−X1, X2+X3 and X2−X3 occur many times during the computation of the WHT4, and by applying CSE algorithm. The result is the following set of expressions: S1 = X2 − X3

S2 = X2 + X3 S3 = X0 − X1 S4 = X0 + X1 Y 0 = S2 + S4 Y 1 = S1 + S3 Y 2 = S4 − S2 Y 3 = S3 − S1 This set can be computed with only 8 additions versus 12 for the original WHT4 specication. To achieve such a reduction, we need to nd a way to automatically recognize common terms S1, S2, S3, S4 in a WHT4 represented as a TED. Let us now review the terms found by the SCSE procedure to understand how this optimization algorithm could be improved.

Common Subexpression Elimination (CSE)

93

Consider the TED of the WHT4 in Figure 4.10. In this TED, the nodes v(X2)1 and v(X2)2 are CSE candidates. The SCSE algorithm applied to this TED will results in a new TED depicted in Figure 4.11, where only two subexpressions have been found: S1 = X2 − X3 and S2 = X2 + X3.

Figure 4.10  Initial TED representation of the WHT4.

Figure 4.11  TED of the WHT4 after applying SCSE.

The TED shown in Figure 4.11 does not expose redundant terms that contain variables X0 and X1 namely X0 + X1 and X0 − X1. This is due to the order of variables in the TED, where X0 and X1 are placed on the top. By placing variables S1 and S2 above variables X0 and X1, it is possible to expose new CSE candidates (nodes v(X0)1 and v(X0)2 ), as illustrated in Figure 4.12.

Figure 4.12  TED of the WHT4 after SCSE with intermediate variables S1 and S2 pushed to the top.

94

Common Subexpression Elimination

The SCSE algorithm can be applied again to extract the new redundant subexpressions, rooted at nodes v(X0)1 and v(X0)2 . The TED obtained after such an extraction is shown in Figure 4.13.

Figure 4.13  TED of the WHT4 after the extraction of all common terms. This example shows that nding all common subexpressions in a mathematical expression can be performed with a TED-based representation by dynamically modifying the order of variables during common subexpression elimination. To achieve this, several modications must be made to the SCSE algorithm to provide new opportunities in nding CSE candidates in the TED graph resulting in Algorithm 6. We refer to this new algorithm as a Dynamic Common Subexpression Elimination (DCSE).

4.1.2.2 Preliminary version of DCSE By applying the DCSE algorithm to the TED, all the redundant terms that exist in the mathematical expressions of a Walsh Hadamard Transform are found. In fact, this algorithm performs a full radix 2 decomposition of the WHT transforms as shown by Table 4.2. Notice that the WHT matrix is only composed of elements equal to ±1 and hence does not involve any multiplication.

Common Subexpression Elimination (CSE)

95

Algorithm 6 Function: DCSE() i=0 toplevel=level of the highest node in the TED if Some nodes are coecients then contant_limit=level of the lowest node corresponding to a coecient

else

constant_limit=toplevel

end if

Create the list L of CSE candidates nodes while L = 6 ∅ do i++ Choose the lowest candidate v ∈ L Create a new root node Si . Create a new internal node Si located just above node v Create the list P of direct parent nodes of v for all p ∈ P do Redirect the outgoing edges of p to the internal node Si

end for

Connect the root node Si to the candidate node if constant_limit