automatic hardware implementation tool for a discrete ... - Le2i - CNRS

off between decision speed, classification performances and silicon area ..... 5 Details of the first stage, coding constants in the architecture of the FPGA ..... Bump. Fig. 11 Example of segmentation results using threshold and hyperrectangles ...
654KB taille 29 téléchargements 240 vues
AUTOMATIC HARDWARE IMPLEMENTATION TOOL FOR A DISCRETE ADABOOST BASED DECISION ALGORITHM J. Mitéran, J. Matas, E. Bourennane, M. Paindavoine, J. Dubois Le2i - UMR CNRS 5158 Aile des Sciences de l’ingénieur Université de Bourgogne - BP 47870 - 21078 Dijon - FRANCE [email protected] Center for Machine Perception – CVUT, Karlovo Namesti 13, Prague, Czech Republic

Abstract. We propose a method and a tool for automatic generation of hardware implementation of a decision rule based on the Adaboost algorithm. We review the principles of the classification method and we evaluate its hardware implementation cost in terms of FPGA’s slice, using different weak classifiers based on the general concept of hyperrectangle. The main novelty of our approach is that the tool allows the user to find automatically an appropriate trade-off between classification performances and hardware implementation cost, and that the generated architecture is optimised for each training process. We present results obtained using Gaussian distributions and examples from UCI databases. Finally, we present an example of industrial application of real-time textured image segmentation.

Keywords : Adaboost, FPGA, classification, hardware, image segmentation

1

INTRODUCTION

In this paper, we propose a method of automatic generation of hardware implementation of a particular decision rule. This paper focuses mainly on high speed decisions (approximately 15 to 20 ns per decision) which can be useful for hi-resolution image segmentation (low level decision function) or pattern recognition tasks in very large image databases. Our work – in grey in the (Fig. 1) - is designed in order to be easily integrated in a System-On-Chip, which can perform the full process: acquisition, feature extraction and classification, in addition to other custom data processing.

Virtual component (IP) for low level feature extraction Input data (pixels)

IP for low level decision function

IP for high level analysis

Output data (pixels, classes, etc.)

Other custom processing Single chip (FPGA)

Fig. 1 Principle of a decision function integrated in a System-On-Chip Many implementations of particular classifiers have been proposed, mainly based on neural networks [1, 2, 3] or more recently on Support Vector Machine (SVM) [4]. However, the implementation of a general classifier is not often optimum in terms of silicon area, because of the general structure of the selected algorithm, and a manual VHDL description is often a long and difficult task. During the last years, some high level synthesis tools, which consist of translating a high level behavioural language description into a register transfer level representation (RTL) [5] have been developed and which allow such a manual description to be avoided. Compilers are available for example for SystemC, Streams-C, Handel-C [6, 7] or for translation of DSP binaries [8]. Our approach is slightly different, since in the case of supervised learning, it is possible to compile the learning data in order to obtain the optimized architecture, without the need of a high-level language translation.

The aim of this work is to provide the EDA tool (Boost2VHDL, developed in C++) which generates automatically the hardware description of a given decision function, while finding an efficient tradeoff between decision speed, classification performances and silicon area which we will call hardware implementation cost denoted as λ . The development flow is depicted in Fig. 2. The idea is to generate automatically the architecture from the learning data and the results of the learning algorithm. The first process is the learning step of a supervised classification method, which produces, off-line, a set of rules and constant values (built from a set of samples and their associated classes). The second step is also an off-line process. During this step, called Boost2VHDL, we built automatically from the previously processed rules the VHDL files implementing the decision function. In a third step, we use a standard implementation tool, producing the bit-stream file which can be downloaded in the hardware target. A new learning step will give us a new architecture. During the on-line process, the classification features and the decision function are continuously computed from the input data, producing the output class (see Fig. 1). This approach allows us to generate an optimised architecture for a given learning result, but imply the use of a programmable hardware target in order to keep flexibility. Moreover, the time constraints for the whole process (around 20 ns per acquisition/feature extraction/decision) imply a high use of parallelism. All the classification features have to be computed simultaneously, and the intrinsic operations of the decision function itself have to be computed in parallel. This naturally led us using FPGA as a potential hardware target. On-line decision step

Input data (pixels, images)

Input data (pixels, images)

Learning step

Classification rules and constant values Off-line learning step

Automatic generation tool Boost2VHDL

Decision function VHDL Files

Synthesis Standard Tools (Xilinx)

Off-line architecture generation

FPGA configuration bit-stream

Classification on Chip (FPGA) Output data (pixels, classes, etc.)

Fig. 2 Development flow In recent years FPGAs have become increasingly important and have found their way into system design. FPGAs are used during development, prototyping, and initial production and can be replaced by hardwired gate arrays or application specific component (ASIC) for high volume production. This trend is enforced by rapid technological progress, which enables the commercial production of ever more complex devices [9]. The advantage of these components compared to ASIC is mainly their onboard reconfigurability, and compared to a standard processor, their high level of potential parallelism [10]. Using reconfigurable architecture, it is possible to integrate the constant values in the design of the decision function (here for example the constants resulting from the learning step), optimising the number of cells used. We consider here the slice (Fig. 3) as the main elementary structure of the FPGA and the unit of λ . One component can contain a few thousand of these blocks. While the size of these components is always increasing, it is still necessary to minimize the number of slices used by each function in the chip. This reduces the global cost of the system, increases the classification performance and the number of operators to be implemented, or allows the implementation of other processes on the same chip.

2

Fig. 3 Slice structure We choose the well known Adaboost algorithm as the implemented classifier. The decision step of this classifier consists in a simple summation of signed numbers [11, 12, 13]. Introduced by Shapire in 1990, Boosting is a general method of producing a very accurate prediction rule by combining rough and moderately inaccurate "rules of thumb". Most recent work has been on the "AdaBoost" boosting algorithm and its extensions. Adaboost is currently used for numerous researches and applications, such as the Viola-Jones face detector [14], or in order to solve the image retrieval problem [15] or the Word Sense Disambiguation problem [16], or for prediction in wireless telecommunications industry [17]. It can be used in order to improve classification performances of other classifiers such as SVM [18]. The reader will find a very large bibliography on http://www.boosting.org. Boosting, because of its interesting properties of maximizing margins between classes, is one of the most currently used and studied supervised method in the machine learning community, with Support Vector Machine and neural networks. It is a powerful machine learning method that can be applied directly, without any modification to generate a classifier implementable in hardware, and a complexity/performance tradeoff is natural in the framework: Adaboost learning constructs gradually a set of classifiers with increasing complexity and better performance (lower crossvalidated error). All along this study we kept in mind the necessity of obtaining high performances in terms of classification. We performed systematically measurements of classification error e (using a ten-fold cross validation protocol). Indeed, in order to follow real-time processing and cost constraints, we had to minimise the error e while minimising the hardware implementation cost λ and maximise the decision speed. The maximum speed has been obtained using a fully parallel implementation. In the first part of this paper, we present the principle of the proposed method, reviewing the Adaboost algorithm. We describe how it is possible, given the result of a learning step, to estimate the full parallel hardware implementation cost in terms of slices. In the second part, we define a family of weak classifiers suitable to hardware implementation, based on the general concept of hyperrectangle. We present the algorithm which is able to find a hyperrectangle which minimizes the classification error and allows us to find a good trade-off between classification performance and the hardware implementation cost which we estimated. This method is based on a previous work: we have shown in [19, 20] that it is possible to implement a hyperrectangles based classifier in a parallel component in order to obtain the required speed. Then, we define the global hardware implementation cost, taking into account the structure of the Adaboost method and the structure of the weak classifiers. In the third part, results are presented: we applied the method on Gaussian distributions, which are often used in literature for performance evaluation of classifiers [21], and we present results obtained on real databases coming from the UCI repository. Finally, we applied the method to an industrial problem, which consists in the real-time visual inspection of CRT cathodes. The aim is to perform a real time image segmentation based on pixel classification. This segmentation is an important preprocessing used for detection of anomalies on the cathode. The main contributions of this paper are the from-learning-data-to-architecture tool, and in the Adaboost process, the introduction of using hyperrectangles as a possible optimisation of classification performances and hardware cost.

3

2

PROPOSED METHOD

2.1

Review of Adaboost

The basic idea introduced by Schapire and Freund [11, 12, 13] is that a combination of single rules or “weak classifiers” gives a “strong classifier”. Each sample is defined by a feature vector x=(x1, x 2,..., x D )T in an D dimensional space and its corresponding class: C x = y ∈ { −1, +1 } in the binary case. We define the weighted learning set S of p samples as: S = { ( x1, y1, w1 ), ( x 2 , y2 , w2 ),..., ( x p , y p , w p ) } . Where wi is the weight of the ith sample. Each iteration of the process consists in finding the best weak classifier as possible, i.e. the classifier for which the error is minimum. If the weak classifier is a single threshold, all the thresholds are tested, and the After each iteration, the weights of the misclassified samples are increased, and the weights of the well classified sample are decreased. The final class y is given by:  T  y x = sgn  ∑ αt ht x  (1)  t =1  Where both αt and ht are to be learned by the following boosting procedure. 1. Input S = { ( x1, y1, w1 ), ( x 2 , y2 , w2 ),..., ( x p , y p , w p ) } , number of iteration T (

)

(

)

(

)

1 (0) 2. Initialise wi = for all i=1, …, p

p

3. Do for t=1, …, T 3.1 Train classifier with respect to the weighted samples set and obtain hypothesis

ht : x → { −1, +1 }

3.2 Calculate the weighted error εt of ht : p

εt =

∑ wi(t ) I ( yi i =1

3.3 Compute the coefficient

(

αt

1 − εt αt = log εt 2 3.4 Update the weights 1

(t +1)

wi

=

≠ ht ( xi ) )

)

wi(t ) exp { −αt yi ht ( xi ) } Zt

Where Zt is a normalization constant: Zt = 2 εt (1 − εt ) 4. Stop if εt = 0 or εt ≥

1 and set T=t-1 2

 T  αt ht x   ∑  t =1 The characteristics of the classifier we have to encode in the architecture are the coefficients αt for t=1, …, T, and the intrinsic constants of each weak classifier ht . 5. Output : y ( x ) = sgn 

2.2

(

)

Parallel implementation of the global structure

The final decision function to be implemented (eq. 1) is a particular sum of products, where each product is made of a constant ( αt ) and the value -1 or +1 depending of the output of ht . It is then possible to avoid computation of multiplications, which is an important gain in terms of hardware cost compared to other classifiers such as SVM or standard neural networks. The parallel structure of a possible hardware implementation is depicted in Fig. 4.

4

x

h0

+α0

Mux

−α0

Set of Adders

ht

sgn

y

+αt

Mux −αt

Fig. 4 Parallel implementation of Adaboost In terms of slices, the hardware cost can be expressed as follows: λ = (T − 1)λadd + λT where λadd is the cost of an adder (which will be considered as a constant here), and λT is the cost of the parallel implementation of the set of the weak classifiers : T

λT =

∑λ

t

t =1

where λt is the cost of the weak classifier ht associated to the multiplexers. One can note that due to the binary nature of the output of ht , it is possible to encode the results of additions and subtractions in the 16 bit LUT of FPGA, using the output of the weak classifiers as addresses (Fig. 5). This is the first way to obtain an architecture optimised for a given learning result. The second way will be the implementation of the weak classifiers. h0 h1

h2 h3

h0 h1

h2

16 bit +α +α + α +α + α # −α − α

1

LUT +α +α +α −α −α −α

1



16 bit +α +α +α + α # −α − α

16 bit +α +α + α +α + α # −α − α

+α 0

1

0

1

0

0

2

3

2

3

2

3



α2

Bit 0

α3

+α 0

1

+α 0

1

0

1

LUT +α +α +α −α −α −α

0

1



2

3

2

3

2

3



α2

Bit 7

Set of Adders

α3

h3 h4 h5

h6 h7

+α 4

5

4

5

4

5

LUT +α +α +α −α −α −α

4

5



6

7

6

7

6

7

α6



Bit 0

α7

Fig. 5 Details of the first stage, coding constants in the architecture of the FPGA Since the classifier ht is used T time, it is critical to optimise its implementation in order to minimise the hardware cost. As a simple classifier, single parallel axis threshold is often used in the literature about Boosting. However, this type of classifier requires a large number of iterations T and hence the hardware cost increases (as it depends on the number of additions to be performed in parallel). To increase the complexity of the weak classifier allows faster convergence, and then minimises the number of additions, but this will also increase the second member of the equation. We have then to find a trade off between the complexity of ht and the hardware cost.

5

3

WEAK CLASSIFIER DEFINITION AND IMPLEMENTATION OF THE WHOLE DECISION FUNCTION

3.1

Choice of the weak classifier - definitions

It has been proved in the literature that decision trees based on hyperrectangles (or union of boxes) instead of single threshold give better results [22]. Moreover, the decision function associated with a hyperrectangle can be easily implemented in parallel (Fig. 8). However, there is no algorithm in the complexity of D which allows us to find the best hyperrectangle, i.e. minimising the learning error. Therefore we will use a suboptimum algorithm to find it. We defined the generalised hyperrectangle as a set H of 2D thresholds and a class yH , with yH ∈

{ −1, +1 }

H = { θ1l , θ1u , θ2l , θ2u , ..., θDl , θDu , yH }

Where θkl and θku are respectively the lower and upper limits of a given interval in the kth dimension. The decision function is D

hH

x

(

)

=

yH ⇔

∏ ( ( xd

> θdl ) and ( xd < θdu ) ) , hH

(

x

)

= −y H

otherwise

d =1

This expression, where product is the logical operator, can be simplified if some of these limits are rejected to the infinite (or 0 and 255 in case of a byte based implementation). Comparisons are not necessary in this case since the result will be always true. It is particularly important for minimising the final number of used slices. Two particular cases of hyperrectangles have to be considered: The single threshold: Γ = { θd , y Γ }

Where θd is a single threshold, hΓ

(

x

)

=

y Γ ⇔ x d < θd ,



(

x

)

d ∈

{ 1, ..., D }

= −y Γ

, and the decision function is:

otherwise

The single interval: ∆ = { θdl , θdu , y ∆ }

Where the decision function is: h∆

(

x

)

=

y∆ ⇔ ( xd > θdl ) and ( xd < θdu ) ,

h∆

(

x

)

= −y ∆

otherwise

In these two particular cases, it is easy to find the optimum hyperrectangle, because each feature is considered independently from the others. The optimum is obtained by computing the weighted error for each possible hyperrectangle and choosing the one for which the error is minimum. In the general case, one has to follow a particular heuristic given a suboptimum hyperrectangle. A family of such classifiers have been defined, based on the NGE algorithm described by Salzberg [23] whose performance was compared to the Knn method by Wettschereck and Dietterich [24]. This method divides the attribute space into a set of hyperrectangles based on samples. The performance of our own implementation was studied in [25]. We will review the principle of the hyperrectangle determination in the next paragraph. 3.2

Review of Hyperrectangle based method

The core of the strategy is the hyperrectangles set SH determination from a set of sample S . The basic idea is to build around each sample {xi , yi } ∈ S a box or hyperrectangle H ( xi ) containing no sample of opposite classes (see Fig. 6 and Fig. 7): l u H ( xi ) = { θil1 , θiu1 , θil 2 , θiu2 , ..., θiD , θiD , yi }

The initial value is set to 0 for all lower bounds and 255 for all upper bounds. In order to measure the distance between two samples in the feature space, we use the “max” distance defined by d∞ ( xi , x j ) = max x ik − x jk k =1,...,D

6

The use of this distance instead of the Euclidian distance allows building easily hyperrectangle instead of hyper-sphere. For all axis of the feature space, we determine the sample {xz , yz }, yz ≠ yi as the nearest neighbour of xi belonging to a different class: z = arg min ( d∞ ( xi , x j ) ) j

The threshold defining one bound of the box is perpendicular to the axis k for which the distance is maximum: k = arg max ( x ik − x zk ) k

if x ik > x zk we compute the lower limit θikl  = R. ( x ik − x zk ) . In the other case we compute the upper limit θiku = R. ( x zk − x ik ) . The parameter R should be less or equal to 0.5. This constraint ensures that the hyperrectangle cannot contain any sample of opposite classes. The procedure is repeated until finding all the bounds of H ( xi ) . x2

x1 x2

x5 x6

x4

x7 x3 x0

In this case : i=4 , z=7, k = 1

x8

u θ41 = R. ( x 71 − x 41 )

x1

u θ41

Fig. 6 Determination of the first limit of H ( x 4 ) x2

x2

u θ42

x4 l θ42

l θ41

u θ41

Determination of H ( x 4 )

x1

Hyperrectangles obtained after merging step

x1

Fig. 7 Hyperrectangle computation During the second step, hyperrectangles of a given class are merged together in order to eliminate redundancy (hyperrectangles which are inside of other hyperrectangle of the same class). We obtain a set SH of hyperrectangles : S H = { H 1, H 2 , ..., H q }

We evaluated the performance of this algorithm in various cases, using theoretical distributions as well as real sampling [19]. We compared the performance with neural networks, the Knn method and a Parzen’s kernel based method [26]. It clearly appears that the algorithm performs poorly when the inter-class distances are too small: an important number of hyperrectangles are created in the overlap area, slowing down the decision or increasing the implementation cost. However, it is possible to use

7

the hyperrectangle generated as a step of the Adaboost process, selecting the best one in terms of classification error. 3.3

Boosting general Hyperrectangle and combination of weak classifiers

From SH we have to build one hyperrectangle Hopt minimising the weighted error. To obtain this result, we merge hyperrectangles following a one-to-one strategy, thus building q’=q(q-1) new hyperrectangles. We keep the hyperrectangle which gives the smallest weighted error. For each iteration of the 3.1 Adaboost step, the algorithm is: 3.1.1 Initialise εmin = 1.0 3.1.2 Do for each class y=-1,1 Do for i=0, …, q’(y) Do for j=i+1, …, q’(y) Build H temp = H i ∪ H j

εH the weighed error based on H temp εH < εmin then Hopt = H temp and εH = εmin

Compute if end j end i end y 3.1.3 Output : hH = Hopt

In order to optimise the final result, it is possible to combine the previous approaches, finding for each iteration the best weak classifier between the single threshold hΓ , the interval h∆ , and the general hyperrectangle hH . The step 3 of the Adaboost algorithm becomes: 3. Do for t=1, …, T 3.1 Train classifier with respect to the weighted samples set { S, d(t ) } and obtain the three hypothesis hΓ ,

h∆ and hH 3.2 Calculate weighted errors εΓ , ε∆ and εH introduced by each classifiers 3.3 Choose ht from {hΓ , h∆ , hH } for which εt = min(εΓ , ε∆ , εH ) 3.4 Estimate λ

As we will see in the results presented in the last paragraph, this strategy allows minimising the number of iterations, and thus minimising the final hardware cost in most of the case, even if the hardware cost of the implementation of an hyperrectangle is locally more important than the cost of the implementation of a single threshold. 3.4

Estimation of the hyperrectangle hardware implementation cost

As the elementary structure of the hyperrectangle is based on numerous comparisons performed in parallel (Fig. 8), it is necessary to optimise the implementation of the comparator. x

x0

> θ0l

x0

< θ0u

AND

AND xD

> θDl

xxD0

>