Slides - Laurent Candillier

Nomao. ML issues. In-house experiments. Nomao challenge. Discussion. Design and Analysis of the Nomao Challenge. Active Learning in the Real-world.
4MB taille 23 téléchargements 348 vues
Nomao

ML issues

In-house experiments

Nomao challenge

Design and Analysis of the Nomao Challenge Active Learning in the Real-world http ://www.nomao.com/labs/challenge [email protected], [email protected]

Discussion

Nomao

ML issues

In-house experiments

1

Nomao Smartphone Web Labs

2

ML issues Deduplication Real-world

3

In-house experiments Boosting stumps Boosting trees

4

Nomao challenge Protocol Results

5

Discussion

Nomao challenge

Discussion

Nomao Smartphone

Search

ML issues

In-house experiments

Nomao challenge

Discussion

Nomao Smartphone

Results

ML issues

In-house experiments

Nomao challenge

Discussion

Nomao Smartphone

Map

ML issues

In-house experiments

Nomao challenge

Discussion

Nomao Smartphone

Spot page

ML issues

In-house experiments

Nomao challenge

Discussion

Nomao Smartphone

User page

ML issues

In-house experiments

Nomao challenge

Discussion

Nomao

ML issues

Web

Search for a bar

In-house experiments

Nomao challenge

Discussion

Nomao Web

Bar page

ML issues

In-house experiments

Nomao challenge

Discussion

Nomao

ML issues

Web

NLP and Reco

In-house experiments

Nomao challenge

Discussion

Nomao

ML issues

Web

Duplicate issue

In-house experiments

Nomao challenge

Discussion

Nomao

ML issues

In-house experiments

Labs

Research

[Candillier, 2011] Information Retrieval Recommender Systems Machine Learning Natural Language Processing Graph Mining http ://www.nomao.com/labs

Nomao challenge

Discussion

Nomao

ML issues

In-house experiments

Deduplication

1

Nomao Smartphone Web Labs

2

ML issues Deduplication Real-world

3

In-house experiments Boosting stumps Boosting trees

4

Nomao challenge Protocol Results

5

Discussion

Nomao challenge

Discussion

Nomao

ML issues

In-house experiments

Nomao challenge

Discussion

Deduplication

The deduplication issue

Which descriptions refer to the same spot ? ID 1 2 3

Name La poste La poste La poste nationale

Phone 3631 0320313131 3631

Address 13 Rue De La Clef 59000 Lille France 13 Rue Nationale 59000 Lille France 13 r. nationale 59000 lille

GPS (50.64, 3.04) (50.63, 3.05) (50.63, 3.05)

Nomao

ML issues

In-house experiments

Nomao challenge

Discussion

Deduplication

The deduplication dataset

29,104 initial examples labeled by hand 118 comparison features IDS 1#2 1#3 2#3

trigram(Name) 1 0.47 0.47

levenshtein(Phone) 0.3 1 0.3

levenshtein(Address) 0.78 0.52 0.74

distance(GPS) 0.99 0.99 1

label -1 -1 +1

Nomao

ML issues

In-house experiments

Nomao challenge

Real-world

Data distribution issue feelings selection vs. random vs. active learning [Sarawagi and Bhamidipaty, 2002]

Discussion

Nomao

ML issues

In-house experiments

Nomao challenge

Real-world

Real-world issues

representativeness of the training dataset ⇒ data distribution issue scalability of the proposed method ⇒ Nomao dataset contains millions of examples practicability of the labeling process ⇒ purchase data labels by batches [Lemaire et al., 2007]

Discussion

Nomao

ML issues

In-house experiments

Boosting stumps

1

Nomao Smartphone Web Labs

2

ML issues Deduplication Real-world

3

In-house experiments Boosting stumps Boosting trees

4

Nomao challenge Protocol Results

5

Discussion

Nomao challenge

Discussion

Nomao

ML issues

In-house experiments

Nomao challenge

Discussion

Boosting stumps

Minimize the margin

Active learning based on boosting [Wang et al., 2009] Select examples closest to the margin returned by the weak learners Focus on examples that maximize the uncertainty about their label ⇒ boosting of stumps [Torre et al., 2010] + 3 methods for selecting examples : 1

explore the examples space : select examples at random

2

exploit boosting : select examples closest the margin

3

mix : random selection weighted by the margin : wmargin

Nomao

ML issues

In-house experiments

Nomao challenge

Boosting stumps

1st labelings

4 initial datasets : 1

init : 28,130 initial examples

2

rand : 974 examples picked randomly

3

marg : 917 examples closest to the boosting margin

4

wmarg : 964 selected at random weighted by the margin

Discussion

Nomao

ML issues

In-house experiments

Boosting stumps

Protocol for experiments

Nomao challenge

Discussion

Nomao

ML issues

In-house experiments

Nomao challenge

Discussion

Boosting stumps

1st results

XXX test XXX train

initial (reference) + random (explore) + margin (exploit) + wmargin (compromise)

init (28,130)

rand (974)

marg (917)

wmarg (964)

error

1006 1015 1043 1062

30 29 33 32

505 515 243 248

432 438 234 230

6.37% 6.44% 5.01% 5.07%

active learning improves results improvements more significant on tricky examples degrade results on initial dataset

Nomao

ML issues

In-house experiments

Nomao challenge

Discussion

Boosting trees

Improve results

Same with C5 boosting of decision trees [Quinlan, 1996] XXX test XXX train

initial (reference) + random (explore) + margin (exploit) + wmargin (compromise)

init (28,130)

rand (974)

marg (917)

wmarg (964)

error

466 444 496 475

10 9 11 8

251 248 101 112

266 253 129 96

3.20% 3.08% 2.38% 2.23%

better results exploration improves results even on initial data compromise leads to the best results

Nomao

ML issues

In-house experiments

Nomao challenge

Discussion

Boosting trees

Next results

Add 2 new datasets : rand : 986 examples selected with “real” random wmarg5 : 995 selected at random weighted by the C5 margin XXX train

test

XXX

full no random no margin no wmargin no wmargin5

init (29,104)

rand (986)

marg (917)

wmarg (964)

wmarg5 (995)

error

548 571 540 546 529

24 29 26 23 27

63 61 85 72 61

63 73 74 85 68

143 160 160 170 218

2.55% 2.71% 2.68% 2.72% 2.74%

each active dataset helps handling better its own kind of data

Nomao

ML issues

In-house experiments

Nomao challenge

Boosting trees

Discussion

can improve results with more adapted learning machines can improve results with better active learning methods be careful not to degrade results on the initial data

Discussion

Nomao

ML issues

In-house experiments

Protocol

1

Nomao Smartphone Web Labs

2

ML issues Deduplication Real-world

3

In-house experiments Boosting stumps Boosting trees

4

Nomao challenge Protocol Results

5

Discussion

Nomao challenge

Discussion

Nomao

ML issues

In-house experiments

Nomao challenge

Discussion

Protocol

Challenge data

29,104 training examples : initial data : labeled 1,985 test examples : selected at random : labels not provided 100,000 unlabeled examples : selected at random : no labels

Nomao

ML issues

In-house experiments

Nomao challenge

Protocol

Challenge protocol

2 active campaigns : ask for 100 new labels 3 test campaigns : provide labels for test dataset use the ACC and AUC to evaluate the results goal : the best improvement thanks to active learning AND beat the baseline model

Discussion

Nomao

ML issues

In-house experiments

Nomao challenge

Discussion

Protocol

Baseline method

Naive Bayes classifier A: the 10 examples having the lowest probability to belong to the class "+1"

B: the 10 examples having the strongest probability to belong to the class "+1"

D

E

0

1

0.5

P(’+1’ | X)

C:50 examples arround the boundary decision (25 below and 25 above) D: 15 examples uniformly distributed between A and C

E: 15 examples uniformly distributed between C and B

Nomao

ML issues

In-house experiments

Nomao challenge

Discussion

Results

All results

Baseline vs. Nomao (in-house) vs. Tengy Sun’s approach Method Active phase AUC on TD Error on TD Error on AD

1 0.9488 19.9% 37.8%

Baseline 2 0.9786 9.6% 32.5%

3 0.9794 9.4% ∅

1 0.9807 12% 32.5%

Nomao 2 0.9816 9% 45%

3 0.9821 7.5% ∅

1 0.9629 7.3% 24.4%

T. Sun 2 0.9631 7.2% 8.5%

T. Sun : the best results Baseline & Nomao : better improvements with active learning all selected tricky examples

3 0.9633 7.2% ∅

Nomao

ML issues

In-house experiments

Nomao challenge

Results

Difficulty

How difficult is the prediction on the examples asked by the participants Method AD1 (Baseline) AD2 (Baseline) AD1 (Nomao) AD2 (Nomao) AD1 (T.Sun) AD2 (T.Sun)

Baseline ∅ ∅ 37% 18.6% 29% 16.7%

Nomao 8.5% 8.6% ∅ ∅ 38.4% 9.5%

T. Sun 7.3% 9.9% 24.7% 7% ∅ ∅

Average 8.6% 21.6% 23.5%

Discussion

Nomao

ML issues

In-house experiments

Nomao challenge

Results

Final results

On all test data available Method test marg wmarg wmarg5 total

Baseline 9.4% 22.9% 21.3% 33.1% 19%

Nomao 7.5% 24% 22.4% 45.8% 22%

T. Sun 7.2% 17.9% 16.5% 26.3% 15%

Discussion

Nomao

ML issues

In-house experiments

1

Nomao Smartphone Web Labs

2

ML issues Deduplication Real-world

3

In-house experiments Boosting stumps Boosting trees

4

Nomao challenge Protocol Results

5

Discussion

Nomao challenge

Discussion

Nomao

ML issues

In-house experiments

Nomao challenge

Discussion

Real-world issues

initial distribution is biased ⇒ predicting labels on randomly selected examples is not trivial even more difficult on actively selected examples ⇒ unprecise address, shops in malls, doctor’s surgeries or post offices from 29,104 to 34,465 examples ⇒ C5 error < 3%

Nomao

ML issues

In-house experiments

Nomao challenge

Discussion

Final results of C5

full no random no margin no wmargin no wmargin5 no baseline no nomao no tsun

init (29104) 568 572 547 570 525 570 577 564

rand (1985) 108 115 108 110 105 109 107 105

marg (917) 61 59 84 69 73 55 54 57

wmarg (964) 65 65 85 79 74 65 64 61

wmarg5 (995) 152 156 163 167 269 155 148 149

baseline (163) 11 11 11 12 12 13 10 11

nomao (167) 22 21 24 26 27 24 24 22

tsun (170) 26 25 27 25 28 27 27 26

wmarg5 is the best to improve C5 results 269 / 995 = 27% error on wmarg5 if it is not used for training

error 2.94% 2.97% 3.04% 3.07% 3.23% 2.95% 2.93% 2.89%

Nomao

ML issues

In-house experiments

Nomao challenge

Model-dependence

The relevance of the active learning process is model-dependent e.g. wmarg5 is the more appropriate data for C5 C5 (init) 240

C5 (init+nomao) 148

C5 (init+baseline) 155

C5 (init+tsun) 185

Discussion

Nomao

ML issues

In-house experiments

Nomao challenge

Discussion

Discussion

Next tests with Machine Learning of winner Tengyu Sun : Adaboost data available on the UCI Machine Learning Repository [Frank and Asuncion, 2010] Here we are ! !

Nomao

ML issues

In-house experiments

Nomao challenge

Discussion

Participation 12 registrations (Stanford, Carnegie Mellon...) 1 participant (Tsinghua University) Why were there so few participants ? ? Lack of communication ? Timing issue ? Problem interest ? Real-world issues ?

Discussion

Nomao

ML issues

Congratulations

In-house experiments

Nomao challenge

Discussion

Nomao

ML issues

In-house experiments

Nomao challenge

Discussion

Candillier, L. (2011). Nomao : la recherche géolocalisée personnalisée. In Zighed, D. A. and Venturini, G., editors, 11ème Conférence Internationale Francophone sur l’Extraction et la Gestion des Connaissances (EGC), volume 1, pages 259–261. Frank, A. and Asuncion, A. (2010). UCI machine learning repository [http ://archive.ics.uci.edu/ml]. Lemaire, V., Bondu, A., and Clérot, F. (2007). Purchase of data labels by batches : study of the impact on the planning of two active learning strategies. Technical report, Orange Labs. http://perso.rd.francetelecom.fr/lemaire/publis/ iconip_2007_camera_ready.pdf. Quinlan, R. (1996). Bagging, boosting and c4.5.

Nomao

ML issues

In-house experiments

Nomao challenge

Discussion

In 13th National Conference on Artificial Intelligence, pages 725–730. Sarawagi, S. and Bhamidipaty, A. (2002). Interactive deduplication using active learning. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 269–278. Torre, F., Faddoul, J.-B., Chidlovskii, B., and Gilleron, R. (2010). Boosting multi-task weak learners with applications to textual and social data. In 9th International Conference on Machine Learning and Applications (ICMLA), pages 367–372. Wang, Z., Song, Y., and Zha, C. (2009). Efficient active learning with boosting. In Proceedings of the 9th SIAM International Conference on Data Mining, pages 1232–1243.