Classification and Regression Tree Introduction to CARTs Estimate Impurity
2
Committee Methods Bagging Boosting
3
Building CARTs Splits Construction Parameters
4
Conclusion
Ugo Jardonnet
Tree Based Models
2 / 21
CART
Outline 1
Classification and Regression Tree Introduction to CARTs Estimate Impurity
2
Committee Methods Bagging Boosting
3
Building CARTs Splits Construction Parameters
4
Conclusion
Ugo Jardonnet
Tree Based Models
3 / 21
CART
Introduction to CARTs
Classification tree
a
b
Ugo Jardonnet
Tree Based Models
c
4 / 21
CART
Introduction to CARTs
Classification tree
a
b
Ugo Jardonnet
Tree Based Models
c
5 / 21
CART
Introduction to CARTs
Classification and Regression trees
CARTs Binary trees Efficient for classification AND regression Expert friendly
Ugo Jardonnet
Tree Based Models
6 / 21
CART
Estimate Impurity
Estimate Node Impurity
CARTs Classification: Giny Index ...
Regression: Variance Variance Var(X ) = E (X − E [X ])2 ...
Ugo Jardonnet
Tree Based Models
7 / 21
Committee
Outline 1
Classification and Regression Tree Introduction to CARTs Estimate Impurity
2
Committee Methods Bagging Boosting
3
Building CARTs Splits Construction Parameters
4
Conclusion
Ugo Jardonnet
Tree Based Models
8 / 21
Committee
Bagging
Random Forest
a
b
Ugo Jardonnet
a
f
c
d
e
Tree Based Models
c
e
...
9 / 21
Committee
Bagging
Pro
Random forest Excellent Accuracy Fast and efficient on large datasets Estimate what variables are important Methods for unbalanced Dataset Do not overfit
Ugo Jardonnet
Tree Based Models
10 / 21
Committee
Boosting
Boosting
Boosted Tree Introduced by Freund and Schapire 1995. General method for improving the accuracy of any given classifier/learner better than random. Given a weak learner model h generates a strong learner of the form X α t ht t
Ugo Jardonnet
Tree Based Models
11 / 21
Committee
Boosting
Adaboost
Ugo Jardonnet
Tree Based Models
12 / 21
Committee
Boosting
Pro
Boosting ... over-fits very slowly allows feature selection Standard for a large variety of detection and recognition applications. Face detection [Viola&Jones01] Face recognition [Lu06] Learning from Ambiguously Labeled Images [Cour08] ...
Ugo Jardonnet
Tree Based Models
13 / 21
CARTimpl
Outline 1
Classification and Regression Tree Introduction to CARTs Estimate Impurity
2
Committee Methods Bagging Boosting
3
Building CARTs Splits Construction Parameters
4
Conclusion
Ugo Jardonnet
Tree Based Models
14 / 21
1 2 3 4 5 6 7 8 9 10 11 12 13 14
CARTimpl
Splits
Building CARTs: Naive Split
for ( std :: size_t i = 0; i < features . size () ; i ++) { for ( std :: size_t j = 0; j < observations . size () ; j ++) { int threshold = observations [ j ][ i ]; for ( std :: size_t k = 0; k < observations . size () ; k ++) { if ( observations [ k ] < threshold ) ... else ... } } }
Listing˜1: Scan the entire dataset for each splitting value
Ugo Jardonnet
Tree Based Models
15 / 21
1 2 3 4 5 6 7 8 9 10 11 12
CARTimpl
Splits
Building CARTs: Standard Split
for ( std :: size_t dim = 0; dim < features . size () ; dim ++) { std :: sort ( observations . begin () , observations . end () , [ dim ]( const Obs & a , const Obs & b ) { return a [ dim ] > b [ dim ]; }) ; for ( auto obs : observations ) { ... } }
Listing˜2: Quick sort on each feature
Ugo Jardonnet
Tree Based Models
16 / 21
CARTimpl
Splits
Bucketed
Ugo Jardonnet
Tree Based Models
17 / 21
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
CARTimpl
Splits
Building CARTs: Bucket Split
for ( std :: size_t dim = 0; dim < nb_features ; dim ++) { for ( auto obs : observations ) { int bucket = ( obs [ dim ] - min [ dim ]) / (( max [ dim ] - min [ dim ]) ) * ( slices . size () -1) ; slices [ bucket ] += {y , y * y , 1}; } for ( auto current_slice : slices ) { left_sum , left_sum2 , nb_left += current_slice ; double vleft = variance ( left_sum , left_sum2 , nb_left ) ; ... double gain = vleft + vright ; } }
Listing˜3: Bucket sorting features
Ugo Jardonnet
Tree Based Models
18 / 21
CARTimpl
Splits
Building CARTs: Bucket Split
Possible if splitting Criteria is a direct function of additive sub-variables. var (X ) = E X 2 − (E[X ])2
Ugo Jardonnet
Tree Based Models
19 / 21
Conclusion
Outline 1
Classification and Regression Tree Introduction to CARTs Estimate Impurity
2
Committee Methods Bagging Boosting
3
Building CARTs Splits Construction Parameters
4
Conclusion
Ugo Jardonnet
Tree Based Models
20 / 21
Conclusion
Conclusion let N be the number of observations. Complexities of a split: Naive : nb features × N × N Standard : nb features × (N.log (N) + N) Bucketed : nb features × (N + nb slices) CCL Committee methods have very good properties Rely on the fact that weak learner are indeed weak Good match with CARTS and fast to construct.