OptiCat Multivariate Optimization Platform - Dr. David Farrusseng

maximized (catalyst performance P expressed by yield, which is the product of ..... Diversity indicator: to set rules to compute distance between catalysts.
3MB taille 4 téléchargements 241 vues
OptiCat Multivariate Optimization Platform

David Farrusseng Institut de Recherches sur la Catalyse, CNRS Villeurbanne, France [email protected]

Objectives z

Tutorial on Evolutionary Algorithms ƒ In the frame of catalyst library design ƒ from chemists to chemists

z

Training on OptiCat ƒ To design a new library from the results of the last library ƒ Encoding issue (description of variables) ƒ To validate algorithm settings using simulation mode

z

Questions ?

Legal issues z

Opticat can be downloaded at

http://a.farrusseng.free.fr z

Open source software: CeciL license

z

CNRS is proprietary of the codes

z

PhD student Frédéric Clerc ƒ Informatic supervisor: Rico Rakotomalala (ERIC laboratory) ƒ ChemoInformatic supervisor: David Farrusseng (IRC-CNRS)

Principles of Evolutionary Algorithm

Rational Library Design z

Goals ƒ To find global optimum with a minimum of experiments ƒ To find local optima

Yield

z

Requirements ƒ With high confidence ƒ Iterative process ƒ Handle library series

z

Methods ƒ ƒ ƒ ƒ

Simulated Annealing Tabu Evolutionary Strategy Genetic Algorithm

variable 1

variable 2

Evolutionary Steps z

Encoding

Variables: 7%Pd ; 11%Pt ; 82% TiO2

z

0 0 0 7 0 11 0 0 0 0 82 0

Data Processing

Population

Evaluation

Selection

Matting

Cross-over

Mutation

How do EAs perform ? z

Evolution of the population in the search space

Evolution of the population performances

z

88 77 66 55 44 33 22 11 00 00

10 10

20 20

30 30

40 40

Generation number

Xn

X1

50 50

Case study 1

Variable description The optimal composition of a catalytic material [for the ODHP] is to be found from 8 potential components: V, Mg, B, Mo, La, Mn, Fe and Ga. The sample can be synthezied from 2 sdgsdf

0.12

0.23

0.65

0.02

0.38

0.19

0.55

0.83

0

Evolution of the performances (Y) z

1st generation results

All 4 generations

ƒ Objective function ƒ selection 100 1st generation 2nd generation

80

3rd generation 4th generation

60

3 6

S(C H ) / %

z

40

20

0

0

5

10 15 X(C H ) / % 3 8

20

Evolution in the search space (X) z

1st generation

z

4th generation

Benchmark definition The optimal composition of a catalytic material [for the ODHP] is to be found from 8 potential components: V, Mg, B, Mo, La, Mn, Fe and Ga. Usually, a relationship between the composition and catalytic performance is unknown before screening. For illustration, a quantitative hypothetical relationship between yield and composition of the catalytic material has been set up to simulate problems of real catalyst optimization including phenomena such as synergism of components, poisoning and influence of the preparation method (Eq. (4)) and to estimate the optimal control parameters of our evolutionary concept. The objective function (OF) to be maximized (catalyst performance P expressed by yield, which is the product of selectivity S and conversion X) is described as follows:

Yield =

X1.S1, X2.S2, 0,

if impregnation if coprecipitation if La or B are present

S1 = 66xV . xMg (1 - xV - xMg) + 2xMo - 0.1xMn - 0.1xFe X1 = 66xV . xMg(1 - xV - xMg) - 0.1xMo+1.5xMn + 1.5xFe S2 = 60xV .xMg (1 - 1.3xV - xMg) X2 = 60xV .xMg (1 - 1.3xV - xMg)

Benchmark description z

Search space ƒ 8 continuous variables (wt%): V, Mg, B, Mo, La, Mn, Fe, Ga ƒ 1 categorical variable: preparation method (coprecipitation/impregnation)

z

Non-linear function as surface response

La or B

z

Global maximum ƒ Yield = 7.55% ƒ xV=0.32, xMg=0.32, xMo=0.26 and xMn+xFe=0.10 for impregnation ƒ 12% probability to find Y>7.2%

Mg

V

Evolutionary Strategy (1) z

Encoding for Evolutionary Strategy ƒ Chromosome design is saved as txt file with .oct as extension ES_AppliedCata_coding.exe

z

Evolutionary Strategy Model ƒ Algorithm design is saved as txt file with .opt as extension ƒ Real mode Design (Manual Evaluation) 2a-ES_AppliedCata_Model_RealMode ƒ Real mode Run 2b-ES_AppliedCata_Model_RealMode-RUN ƒ Simulation mode (with Excel VBA macro) 2c- ES_AppliedCata_Model_SimulationMode Not efficient encoding !

Evolutionary Strategy (2)

Not efficient encoding !

z

Crossover in Evolutionary Strategy 0.12

0.23

0.65

0.02

0.38

0.19

0.55

0.83

0

0.43

0.04

0.71

0.34

0.22

0.65

0.72

0.05

1

0.12

0.23

0.65

0.02

0.38

0.65

0.72

0.05

1

0.43

0.04

0.71

0.34

0.22

0.19

0.55

0.83

0

Impossible to set a variable to 0

z

Mutation in Evolutionary Strategy 0.43

0.04

0.71

0.34

0.22

0.19

0.55

0.83

0

0.43

0.04

0.71

0.98

0.22

0.19

0.55

0.83

0

Hardly possible to set a variable to 0

Evolutionary Strategy (3) z

Presence/absence encoding for Evolutionary Strategy V 0

0.12

Mg 1

0.23

B 1

0.65

Mo 0

0.02

La 0

0.38

Mn 1

0.19

Fe 1

1

0.83

0

0.23

0.65

0

0

0.19

0.55

0.83

0%

9.4%

26.5%

0%

0%

7.7%

22.4%

33.9%

2d-ES_coding

z

0.55

Ga 0

Catalyst composition shall be calculated

Evolutionary Strategy Model with presence/absence ƒ Simulation mode (with Excel VBA macro) 2e-Baerns_ES_simulation_Excel

Genetic Algorithm z

Presence/absence encoding for Genetic Algorithm 3-Baerns_GA_coding

z

Evolutionary Strategy Model Simulation mode (with Excel VBA macro) 4-Baerns_GA_simulation_Excel

Batch Mode for statistical assessment z

Automated Evaluation – Simulation with Benchmarks ƒ VBA Excel ƒ MatLab ƒ dll

z

Batch mode for multiple runs ƒ settings ƒ Statistical assessment of algorithms

Baerns: Results of the virtual screening z

Effect of the population size

25

50

100

OptiCat: Results of the virtual screening Evolutionary Strategy: statistics on 25 runs

z

ƒ Non optimized parameters ƒ Dynamic mutation and crossover rates 88

8

8

25

77

7 6

66

50

7

6

5

55

5 4

44

4 3

33

3 2

22

11

00

2

1

0

-1

11

44

7

10

13

16 Var4

19

22

25

28 28

100

1

0

1

4

7

10

13

16 Var4

19

22

25

28

1

4

7

10

13

16

19

Var4

Strong effect of the population size on the algorithm reliability

22

25

28

OptiCat: Results of the virtual screening Genetic Algorithm: statistics on 25 runs

z

ƒ constant mutation and crossover rates ƒ Dynamic mutation and crossover rates 8

7

6

8

25

7

6

8

50

7

6

5

5

5

4

4

4

3

3

3

2

2

2

1

1

1

0

0 1

4

7

10

13

16

19

22

25

28

0 1

4

7

10

13

Var4

16

19

22

25

28

1

4

7

10

13

Var4

16

19

22

100

25

28

Var4

8

8

8

7

7

7

6

6

5

5

4

4

3

3

2

2

1

1

6 5 4 3 2

0

1

4

7

10

13

16 Var4

19

22

25

28

0

1 0

1

4

7

10

13

16 Var4

19

22

25

28

-1

1

4

7

10

13

16 Var4

19

22

25

28

Algorithm design

Algorithm design z

Good balance between: ƒ browsing / exploitation ƒ diversity / elitism

z

Algorithm design is case depending ƒ Shape of the surface response ƒ Noise, outliers

z

Generic vs specific algorithms ƒ ƒ ƒ ƒ ƒ ƒ

Evolutionary or Genetic Algorithm Chromosome design Selection types & parameters Cross-over, mutation type & parameters Population size (fixed by facilities) Iteration number (cost)

Selection for monitoring the selective pressure z z z z

Threshold Ranking Wheel Tournament

G

G

C B A

C

F

40%

E

D

H F D C

E

D

B

E

F

A

A

H

z

G

Options ƒ Elitism

B

D, F, A

A

B, H, D

B

H, F, D

D





Selection options z

Dataset on which the selection is carried out ƒ Last population ƒ All accumulated populations Outliers !

z

Elitism ƒ n best individuals are selected

z

Size of population ƒ Must be set at “same” (Unless meta-modelling is used)

Threshold z

0 < Threshold parameter < 1

G

Individual

Fitness

C B

A B C D E F G H

25 35 50 10 15 7 95 5

Threshold parameter = 0.4 Number of individuals in the population x Threshold parameter = 8 x 0.4 = 3.2 The 3 best individuals are selected (e.g. G, C, E)

A E D F H

40%

Wheel z

0 < selective pressure < 2 (?) Individual

1 is generally used settings for GA The higher the more elitist

Outliers !

A B C D E F G H

Fitness 25 35 50 10 15 7 95 5

Probability 10% 14% 21% 4% 6% 3% 39% 2%

G

H F D C

E A

B

Ranking z

0 < Selective pressure < 2 (?) Individual

1 is generally used settings for GA The higher the more elitist

A B C D E F G H

Fitness 25 35 50 10 15 7 95 5

Rank 5 6 7 3 4 2 8 1

Probability 14% 17% 19% 8% 11% 6% 22% 3%

G C

F D B

E A

Tournament z

0 < number of individuals < population size Individual

2 – 3 are generally used settings for ES The higher the more elitist

3 individuals in the tournament D, F, A

A

B, H, D

B

H, F, D

D

… …

A B C D E F G H

Fitness 25 35 50 10 15 7 95 5

Selection types

z

Tournament with 12 (population size = 25) Threshold Wheel Outliers Rank Tournament with 2

z

The higher diversity

z z z z

ƒ The higher browsing / exploitation ratio ƒ The more chance to get the global maximum ƒ The lower convergence speed z

The higher elitism ƒ The lower browsing / exploitation ratio ƒ The more chance to get trap at a local maximum ƒ The higher convergence speed

Elitism

Diversity

Crossover types z z z z

1-point Multipoints Uniform Options ƒ Fixed crossover probabilities ƒ Dynamic crossover probabilities

0.6 – 0.8 are generally used settings for ES

1-(Pmean / Pbest) is used in Baerns ES

Uniform crossover z z

0 < portion of chromosome exchanged < 1 Can operate on ƒ Encoded genes (GA) ƒ Un-encoded genes (ES)

0.4 – 0.6 are generally used settings for ES The higher the more diversity

Example with value = 0.5

Multipoint crossover z z

0 < number of points < number of genes Can operate on ƒ Encoded genes (GA) ƒ Un-encoded genes (ES)

The higher the more diversity

1-3 are generally used settings for GA

Example with value = 2

Genetic Algorithm vs. Evolutionary Strategy z

Crossover in Evolutionary Strategy ƒ Variable values are exchanged

7 0 0 0 0 7 7 7 7 7 7 7 0 0 0 0

7 0 7 7 0 7 0 0 7 0 7 0 0 7 0 7

z

Crossover in Genetic Algorithm ƒ Variable values are exchanged ƒ Values are “mutated” at the cross-over points

Mutation rate is important in ES

7

7

7

7

0

0

0

0

7

6

0

0

0

1

7

7

5

2

2

7

2

5

5

0

Mutation types

z

Bit-flip ES

z

Options

z

ƒ Fixed crossover probabilities ƒ Dynamic crossover probabilities

0.01– 0.05 are generally used settings for GA

(Pmean / Pbest) is used in Baerns ES

Bit Flip mutation z z

0 < mutation rate < 1 The higher ƒ The more diverse ƒ The higher browsing / exploitation ratio ƒ The lower convergence speed

z

Changes a bit in encoded chromosomes (GA) 0

z

7

2

1

0

7

7

1

Can operate on un-encoded chromosome (ES) ƒ Changes a variable value

ES mutation z z

0 < mutation rate < 1 The higher ƒ The more diverse ƒ The higher browsing / exploitation ratio ƒ The lower convergence speed

z

Changes a gene in un-encoded chromosomes (ES) 0

7

2

1

0

7

7

1

Impact of noisy experimental data z

Two algorithms ƒ elitist (tournament with 4, the best always selected) ƒ normal (tournament with 2)

z

Benchmarck ƒ Baerns ƒ Baerns modified with 5% of exp. errors and 1% of false positive

z

Assessment criteria ƒ Robustness: % of runs which reach 90% of the global maximum at the 10th generation ƒ Convergence speed: mean of the best catalysts at the 5th generation ƒ Value of the Conf. Interval at generation #5

How to add noise & outliers ? z

Modification of a mathematical function in VBA ƒ To add noise on the published benchmarck ƒ Baerns example

z

Evaluation node with dll

0.01

0.05

Typical range: 0.01-0.1

10

Typical range: 0.01-0.05 For either false positive or false negative

Results (1) Normal selection

Elitism selection

Mean; Whisker: Mean±0.95 Conf. Interval

No Noise

Mean Plot (elitist_nonoise in W orkbook1.stw 10v*750c)

8

8

7

7

6

6

5

5

4

4

3

3

2

2

1

1

0

1

4

7

10

13

16

19

22

25

28

0

1

4

7

10

13

Var4

8

8

7

7

6

6

5

5

4

4

3

3

2

2

1

1

0

1

4

7

10

13

16 Var4

19

22

25

28

25

28

Mean Plot (elitism_noise in Workbook1.stw 10v*750c)

Mean Plot (browsing_noise in Workbook1.stw 10v*810c)

Noise

16 Var4

19

22

25

28

0

1

4

7

10

13

16 Var4

19

22

Results (2) No noise, No outliers

50 runs

5% noise, 1% outliers

8

7

6

Yield (%)

5

4

3

2

9S

1

0

1

3

5

7

9

11

13

15

17

19

21

23

8

8

7

7

6

6

5

5

4

4

3

3

2

2

9N

1

0

25

1

Yield (%)

Generation number

3

5

7

9

11

13

15

17

19

21

23

25

0

8

8

7

7

7

6

6

6

5

5

5

4

4

4

3

3

3

2

2

2

16S 1

3

5

7

9

11

13

15

17

19

21

23

16N

1

25

1

3

5

7

9

11

13

15

17

19

21

23

25

8

7

7

7

6

6

6

5

5

5

4

4

4

3

3

3

2

2

2

1

11S 5

7

9

11

13

15

Generation number

7

17

19

21

23

11N

1

25

0

1

3

5

7

9

11

13

15

Generation number

9

11

13

15

17

19

21

23

25

16vN 1

3

5

7

9

11

13

15

17

19

21

17

19

21

23

25

Generation number

8

3

5

Generation number

8

1

3

1

0

0

Generation number

0

1

Generation number

8

0

9vN

1

Generation number

1

Yield (%)

10% noise, 5% outliers

17

19

21

23

11vN

1

25

0

1

3

5

7

9

11

13

15

Generation number

23

25

Effects of GA parameters on the optimization speed

Smoth

Elitisme yes no Pop. Size 24 40 56 Cross-over uniform 1-point 3-point Selection ranking threshold wheel tournement

Noisy

Very noisy

CO oxidation case study

Case study description z

ƒ ƒ ƒ ƒ ƒ ƒ z

Supported bimetallic systems 1 Noble Metal (MN) among Pt, Au, Cu Loading: 0.1-2.1wt% 1 Transition Metal (TM ) among Mo, Nb, V Loading: 1-5wt% 1 support among (CeO2, TiO2, ZrO2)

Testing conditions ƒ 3 temperatures ƒ 200, 250, 300°C

z

Ternary

Composition

CO Conversion

Mo Nb V Au

Cu

Pt

TiO2 ZrO2 CeO2

Chromosome encoding z

Evolutionary Strategy

z

Algorithm Genetic

InIn the present case, the string is composed of 24 bits. . The 4 discrete variables (Temperature, Support, TM and NM types) were encoded in genes of 4 bits each. As indicated previously, they can take 3 different modalities: Cu, Au and Pt for NM; V, Mo and Nb for TM; Ce, Ti, Zr for Support; and T1, T2 and T3 for the tested temperature. The 2 continuous composition variables were also encoded as genes of 4 bits resulting in 16 steps of 0.06% and 0.25% for the NM (0.1-1.1%) and TM (1-5%) concentrations, respectively. Therefore a subspace defined by a ternary at a given temperature encompasses 162 catalysts and the whole search space more than 20.000 experiments (34 x 162).

String representation of 1%Cu, 3%V, Zr, 200°C catalyst solution

Objective function z z

Single objective = easy Yield is the objective function ƒ Y=Conversion x Selectivity ƒ Conversion & selectivity have the same weight and are linear

Concept of desirability ƒ Multive objective ƒ To build an unique indicator of the performance 1

1

0.8

0.8

0.8

0.6

0.6

0.6

0.4

d

1

d d

z

0.4

DSel = 0.001 x S(CO2) + 0.9 0.2

0.2

0

0 0

20

40 60 Conversion (%)

80

100

DConv = 0.0105 x X(CO) if X(CO)< 95% and DConv = 1 if X(CO) ≥ 95%

0

20

40

60

Selectivity (%)

80

100

0.4

DTemp= -0.002 x S(CO2) + 1.4

0.2 0 200

225

250

Tem p °C

Y=DConv x DSel x DTemp

Optimization of Solid Catalyst Using Diversity

Two different diversities z

Diversity based on library fitness (Y) ƒ Diversity indicator: Y(mean)/Y(best), std,… ƒ Monitoring of crossover & mutation rates ƒ When Y(mean)/Y(best) Æ 1 – algorithm is converging – mutation rate shall increase to escape from local maximum

z

Diversity based on catalyst description (X) ƒ Diversity indicator: to set rules to compute distance between catalysts ƒ To discard/allow new catalysts according to distances ƒ Example: Sharing, Tabu, simulated annealing

Baerns: Diversity based on library fitness (Y)

Monitoring crossover & mutation rates

1-(Pmean / Pbest) is used in Baerns ES

(Pmean / Pbest) is used in Baerns ES

Effect of dynamic parameters in ES Crossover = 1-(Pmean / Pbest) Mutation = Pmean / Pbest

Crossover = 1-(0.25*Pmean / Pbest) Mutation = 0.25*Pmean / Pbest

Mean; Whisker: Mean±0.95 Conf. Interval 8

Mean; Whisker: Mean±0.95 Conf. Interval 8

7 7

6 6

5

5

4

4

3

3

2

2

1

1

0

1

4

7

10

13

16

19

22

25

28

0

1

4

7

10

13

Var4

16

19

22

25

28

Var4

Mean; Whisker: Mean±0.95 Conf. Interval 1.0

Mean; Whisker: Mean±0.95 Conf. Interval 1.0

0.8 0.8

0.6

0.6

Crossover Mutation 0.4

0.4 Crossover Mutation

0.2

0.0

0.2

1

4

7

10

13

16 Var4

19

22

25

28

0.0

1

4

7

10

13

16 Var4

19

22

25

28

Two different diversities z

Diversity based on library fitness (Y) ƒ Diversity indicator: Y(mean)/Y(best), … ƒ Monitoring of crossover & mutation rates ƒ When Y(mean)/Y(best) Æ 1 – algorithm is converging – mutation rate shall increase to escape from local maximum

z

Diversity based on catalyst description (X) ƒ Diversity indicator: to set rules to compute distance between catalysts ƒ Assuming that similar catalysts have similar performances ƒ To discard/allow new catalysts according to distances ƒ Example: Sharing, Tabu, simulated annealing

How does Tabu perform ? z

Evolution of the population in the search space

z

Browsing: random sample generation

z

Optimization: ƒ selection of the best sample ƒ decreasing population diversity

Diversity management of populations Tanimoto indice for bit string descriptors

z

ƒ Presence/absence of elements more significant than variation of wt% ƒ Bit string encoding of catalysts

V Mg B Mo La Mn Fe Ga #A #B

T(A,B) =

NA&B

2

=

NA+NB-NA&B

= 0.5

3+3-2

1.0

1/ln

z

Monitoring the Rate of diversity

Tanimoto Indice

0.8

1/t 0.6

lin

0.4

0.2

0.0

0

50

100

150

200

250

Sample number

300

350

400

Opticat: Tabu Search z

Example ƒ Tabu search on Baerns benchmarck

Tabu optimization results

z

80 iterations with 5 catalysts / population (400 catalysts)

z

Optimization evolution 8 7 6 5 4 3 2 1 0 -1 -10 0

Var2 =

8

7

6 6

50 runs

10 20 30 40 50 60 70 80 90 -10 0

10 20 30 40 50 60 70 80 90 -10 0

Var5: 1

10 20 30 40 50 60 70 80 90

Var5: 2

Var5: 3

Yield Var2 (%) Yield (%)

8 5 7 6 5 4 3 2 1 0 -1 -10 0

44 3

10 20 30 40 50 60 70 80 90 -10 0

2 2 8 7 6 5 4 3 2 1 0 -1 -10 0

1 0 0 0 0 10

10 20 30 40 50 60 70 80 90 -10 0

Var5: 4

1010

2020

303010

20 30 40 50 60 70 80 90 -10 0 Var5: 7

40 40

Reliability ?

Var5: 6

50 50

6010 60

20 30 40 50 60 70 80 90 -10 0

Var5: 8number Iteration Iteration

V 4

z

10 20 30 40 50 60 70 80 90

Var5: 5

70 70

80 80

20 30 40 50 60 70 80 90 Var5: 9

Monitoring library diversity in GA (sharing) z z z

To increase the browsing / exploitation ratio To enable to find many local maxima Reliability criteria: % of runs with Y>7.2% (95% of the global maximum)

Algorithm Contest on Reliability iterations population size

z

Simulated Annealing (SA) Tabu Genetic Algorithms

z

Reliability criteria: % of runs with Y>7.2% (95% of the global maximum)

z z

400 80 10

100

1 5 40

8 7.30 +/-0.2

0

Algorithms

GA

Tabu

SA (lin)

SA (1/t)

12

SA (1/ln)

20

28

6

Yield (%)

48

GA + diversity management

61

58

60

Random

Probability Y>7.2

68

40

7

82

80

5 4 3 2 1 0

1

2

3

4

5

6

Iterations

7

8

9

10

Combining Genetic Algorithm and Data Mining

Concept of hybrid algorithm Concept of pre-screening by data mining as iterative screening proceeds

In silico evaluation

Real Population of K catalysts

Evaluation

Final population of K catalysts

selection of K catalysts by ANN

Virtual end criteria

GA

population of µK catalysts

Engine for virtual catalysts design

Implementation in OptiCat Initialization build random population of virtual catalysts

New population of virtual catalysts Bit-flip mutation

Multi point crossover

Tournament Selection

false

Estimation Use the statistical model for estimating the performance of catalysts

Evolution control Synthesize and test promising catalysts

Criterion nb of populations true Best catalyst

Software compatibilities z z

Matlab R12 Statistica 6.1 english version ƒ English regional settings & dates