OptiCat Multivariate Optimization Platform
David Farrusseng Institut de Recherches sur la Catalyse, CNRS Villeurbanne, France
[email protected]
Objectives z
Tutorial on Evolutionary Algorithms In the frame of catalyst library design from chemists to chemists
z
Training on OptiCat To design a new library from the results of the last library Encoding issue (description of variables) To validate algorithm settings using simulation mode
z
Questions ?
Legal issues z
Opticat can be downloaded at
http://a.farrusseng.free.fr z
Open source software: CeciL license
z
CNRS is proprietary of the codes
z
PhD student Frédéric Clerc Informatic supervisor: Rico Rakotomalala (ERIC laboratory) ChemoInformatic supervisor: David Farrusseng (IRC-CNRS)
Principles of Evolutionary Algorithm
Rational Library Design z
Goals To find global optimum with a minimum of experiments To find local optima
Yield
z
Requirements With high confidence Iterative process Handle library series
z
Methods
Simulated Annealing Tabu Evolutionary Strategy Genetic Algorithm
variable 1
variable 2
Evolutionary Steps z
Encoding
Variables: 7%Pd ; 11%Pt ; 82% TiO2
z
0 0 0 7 0 11 0 0 0 0 82 0
Data Processing
Population
Evaluation
Selection
Matting
Cross-over
Mutation
How do EAs perform ? z
Evolution of the population in the search space
Evolution of the population performances
z
88 77 66 55 44 33 22 11 00 00
10 10
20 20
30 30
40 40
Generation number
Xn
X1
50 50
Case study 1
Variable description The optimal composition of a catalytic material [for the ODHP] is to be found from 8 potential components: V, Mg, B, Mo, La, Mn, Fe and Ga. The sample can be synthezied from 2 sdgsdf
0.12
0.23
0.65
0.02
0.38
0.19
0.55
0.83
0
Evolution of the performances (Y) z
1st generation results
All 4 generations
Objective function selection 100 1st generation 2nd generation
80
3rd generation 4th generation
60
3 6
S(C H ) / %
z
40
20
0
0
5
10 15 X(C H ) / % 3 8
20
Evolution in the search space (X) z
1st generation
z
4th generation
Benchmark definition The optimal composition of a catalytic material [for the ODHP] is to be found from 8 potential components: V, Mg, B, Mo, La, Mn, Fe and Ga. Usually, a relationship between the composition and catalytic performance is unknown before screening. For illustration, a quantitative hypothetical relationship between yield and composition of the catalytic material has been set up to simulate problems of real catalyst optimization including phenomena such as synergism of components, poisoning and influence of the preparation method (Eq. (4)) and to estimate the optimal control parameters of our evolutionary concept. The objective function (OF) to be maximized (catalyst performance P expressed by yield, which is the product of selectivity S and conversion X) is described as follows:
Yield =
X1.S1, X2.S2, 0,
if impregnation if coprecipitation if La or B are present
S1 = 66xV . xMg (1 - xV - xMg) + 2xMo - 0.1xMn - 0.1xFe X1 = 66xV . xMg(1 - xV - xMg) - 0.1xMo+1.5xMn + 1.5xFe S2 = 60xV .xMg (1 - 1.3xV - xMg) X2 = 60xV .xMg (1 - 1.3xV - xMg)
Benchmark description z
Search space 8 continuous variables (wt%): V, Mg, B, Mo, La, Mn, Fe, Ga 1 categorical variable: preparation method (coprecipitation/impregnation)
z
Non-linear function as surface response
La or B
z
Global maximum Yield = 7.55% xV=0.32, xMg=0.32, xMo=0.26 and xMn+xFe=0.10 for impregnation 12% probability to find Y>7.2%
Mg
V
Evolutionary Strategy (1) z
Encoding for Evolutionary Strategy Chromosome design is saved as txt file with .oct as extension ES_AppliedCata_coding.exe
z
Evolutionary Strategy Model Algorithm design is saved as txt file with .opt as extension Real mode Design (Manual Evaluation) 2a-ES_AppliedCata_Model_RealMode Real mode Run 2b-ES_AppliedCata_Model_RealMode-RUN Simulation mode (with Excel VBA macro) 2c- ES_AppliedCata_Model_SimulationMode Not efficient encoding !
Evolutionary Strategy (2)
Not efficient encoding !
z
Crossover in Evolutionary Strategy 0.12
0.23
0.65
0.02
0.38
0.19
0.55
0.83
0
0.43
0.04
0.71
0.34
0.22
0.65
0.72
0.05
1
0.12
0.23
0.65
0.02
0.38
0.65
0.72
0.05
1
0.43
0.04
0.71
0.34
0.22
0.19
0.55
0.83
0
Impossible to set a variable to 0
z
Mutation in Evolutionary Strategy 0.43
0.04
0.71
0.34
0.22
0.19
0.55
0.83
0
0.43
0.04
0.71
0.98
0.22
0.19
0.55
0.83
0
Hardly possible to set a variable to 0
Evolutionary Strategy (3) z
Presence/absence encoding for Evolutionary Strategy V 0
0.12
Mg 1
0.23
B 1
0.65
Mo 0
0.02
La 0
0.38
Mn 1
0.19
Fe 1
1
0.83
0
0.23
0.65
0
0
0.19
0.55
0.83
0%
9.4%
26.5%
0%
0%
7.7%
22.4%
33.9%
2d-ES_coding
z
0.55
Ga 0
Catalyst composition shall be calculated
Evolutionary Strategy Model with presence/absence Simulation mode (with Excel VBA macro) 2e-Baerns_ES_simulation_Excel
Genetic Algorithm z
Presence/absence encoding for Genetic Algorithm 3-Baerns_GA_coding
z
Evolutionary Strategy Model Simulation mode (with Excel VBA macro) 4-Baerns_GA_simulation_Excel
Batch Mode for statistical assessment z
Automated Evaluation – Simulation with Benchmarks VBA Excel MatLab dll
z
Batch mode for multiple runs settings Statistical assessment of algorithms
Baerns: Results of the virtual screening z
Effect of the population size
25
50
100
OptiCat: Results of the virtual screening Evolutionary Strategy: statistics on 25 runs
z
Non optimized parameters Dynamic mutation and crossover rates 88
8
8
25
77
7 6
66
50
7
6
5
55
5 4
44
4 3
33
3 2
22
11
00
2
1
0
-1
11
44
7
10
13
16 Var4
19
22
25
28 28
100
1
0
1
4
7
10
13
16 Var4
19
22
25
28
1
4
7
10
13
16
19
Var4
Strong effect of the population size on the algorithm reliability
22
25
28
OptiCat: Results of the virtual screening Genetic Algorithm: statistics on 25 runs
z
constant mutation and crossover rates Dynamic mutation and crossover rates 8
7
6
8
25
7
6
8
50
7
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0 1
4
7
10
13
16
19
22
25
28
0 1
4
7
10
13
Var4
16
19
22
25
28
1
4
7
10
13
Var4
16
19
22
100
25
28
Var4
8
8
8
7
7
7
6
6
5
5
4
4
3
3
2
2
1
1
6 5 4 3 2
0
1
4
7
10
13
16 Var4
19
22
25
28
0
1 0
1
4
7
10
13
16 Var4
19
22
25
28
-1
1
4
7
10
13
16 Var4
19
22
25
28
Algorithm design
Algorithm design z
Good balance between: browsing / exploitation diversity / elitism
z
Algorithm design is case depending Shape of the surface response Noise, outliers
z
Generic vs specific algorithms
Evolutionary or Genetic Algorithm Chromosome design Selection types & parameters Cross-over, mutation type & parameters Population size (fixed by facilities) Iteration number (cost)
Selection for monitoring the selective pressure z z z z
Threshold Ranking Wheel Tournament
G
G
C B A
C
F
40%
E
D
H F D C
E
D
B
E
F
A
A
H
z
G
Options Elitism
B
D, F, A
A
B, H, D
B
H, F, D
D
…
…
Selection options z
Dataset on which the selection is carried out Last population All accumulated populations Outliers !
z
Elitism n best individuals are selected
z
Size of population Must be set at “same” (Unless meta-modelling is used)
Threshold z
0 < Threshold parameter < 1
G
Individual
Fitness
C B
A B C D E F G H
25 35 50 10 15 7 95 5
Threshold parameter = 0.4 Number of individuals in the population x Threshold parameter = 8 x 0.4 = 3.2 The 3 best individuals are selected (e.g. G, C, E)
A E D F H
40%
Wheel z
0 < selective pressure < 2 (?) Individual
1 is generally used settings for GA The higher the more elitist
Outliers !
A B C D E F G H
Fitness 25 35 50 10 15 7 95 5
Probability 10% 14% 21% 4% 6% 3% 39% 2%
G
H F D C
E A
B
Ranking z
0 < Selective pressure < 2 (?) Individual
1 is generally used settings for GA The higher the more elitist
A B C D E F G H
Fitness 25 35 50 10 15 7 95 5
Rank 5 6 7 3 4 2 8 1
Probability 14% 17% 19% 8% 11% 6% 22% 3%
G C
F D B
E A
Tournament z
0 < number of individuals < population size Individual
2 – 3 are generally used settings for ES The higher the more elitist
3 individuals in the tournament D, F, A
A
B, H, D
B
H, F, D
D
… …
A B C D E F G H
Fitness 25 35 50 10 15 7 95 5
Selection types
z
Tournament with 12 (population size = 25) Threshold Wheel Outliers Rank Tournament with 2
z
The higher diversity
z z z z
The higher browsing / exploitation ratio The more chance to get the global maximum The lower convergence speed z
The higher elitism The lower browsing / exploitation ratio The more chance to get trap at a local maximum The higher convergence speed
Elitism
Diversity
Crossover types z z z z
1-point Multipoints Uniform Options Fixed crossover probabilities Dynamic crossover probabilities
0.6 – 0.8 are generally used settings for ES
1-(Pmean / Pbest) is used in Baerns ES
Uniform crossover z z
0 < portion of chromosome exchanged < 1 Can operate on Encoded genes (GA) Un-encoded genes (ES)
0.4 – 0.6 are generally used settings for ES The higher the more diversity
Example with value = 0.5
Multipoint crossover z z
0 < number of points < number of genes Can operate on Encoded genes (GA) Un-encoded genes (ES)
The higher the more diversity
1-3 are generally used settings for GA
Example with value = 2
Genetic Algorithm vs. Evolutionary Strategy z
Crossover in Evolutionary Strategy Variable values are exchanged
7 0 0 0 0 7 7 7 7 7 7 7 0 0 0 0
7 0 7 7 0 7 0 0 7 0 7 0 0 7 0 7
z
Crossover in Genetic Algorithm Variable values are exchanged Values are “mutated” at the cross-over points
Mutation rate is important in ES
7
7
7
7
0
0
0
0
7
6
0
0
0
1
7
7
5
2
2
7
2
5
5
0
Mutation types
z
Bit-flip ES
z
Options
z
Fixed crossover probabilities Dynamic crossover probabilities
0.01– 0.05 are generally used settings for GA
(Pmean / Pbest) is used in Baerns ES
Bit Flip mutation z z
0 < mutation rate < 1 The higher The more diverse The higher browsing / exploitation ratio The lower convergence speed
z
Changes a bit in encoded chromosomes (GA) 0
z
7
2
1
0
7
7
1
Can operate on un-encoded chromosome (ES) Changes a variable value
ES mutation z z
0 < mutation rate < 1 The higher The more diverse The higher browsing / exploitation ratio The lower convergence speed
z
Changes a gene in un-encoded chromosomes (ES) 0
7
2
1
0
7
7
1
Impact of noisy experimental data z
Two algorithms elitist (tournament with 4, the best always selected) normal (tournament with 2)
z
Benchmarck Baerns Baerns modified with 5% of exp. errors and 1% of false positive
z
Assessment criteria Robustness: % of runs which reach 90% of the global maximum at the 10th generation Convergence speed: mean of the best catalysts at the 5th generation Value of the Conf. Interval at generation #5
How to add noise & outliers ? z
Modification of a mathematical function in VBA To add noise on the published benchmarck Baerns example
z
Evaluation node with dll
0.01
0.05
Typical range: 0.01-0.1
10
Typical range: 0.01-0.05 For either false positive or false negative
Results (1) Normal selection
Elitism selection
Mean; Whisker: Mean±0.95 Conf. Interval
No Noise
Mean Plot (elitist_nonoise in W orkbook1.stw 10v*750c)
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
1
4
7
10
13
16
19
22
25
28
0
1
4
7
10
13
Var4
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
1
4
7
10
13
16 Var4
19
22
25
28
25
28
Mean Plot (elitism_noise in Workbook1.stw 10v*750c)
Mean Plot (browsing_noise in Workbook1.stw 10v*810c)
Noise
16 Var4
19
22
25
28
0
1
4
7
10
13
16 Var4
19
22
Results (2) No noise, No outliers
50 runs
5% noise, 1% outliers
8
7
6
Yield (%)
5
4
3
2
9S
1
0
1
3
5
7
9
11
13
15
17
19
21
23
8
8
7
7
6
6
5
5
4
4
3
3
2
2
9N
1
0
25
1
Yield (%)
Generation number
3
5
7
9
11
13
15
17
19
21
23
25
0
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
16S 1
3
5
7
9
11
13
15
17
19
21
23
16N
1
25
1
3
5
7
9
11
13
15
17
19
21
23
25
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
11S 5
7
9
11
13
15
Generation number
7
17
19
21
23
11N
1
25
0
1
3
5
7
9
11
13
15
Generation number
9
11
13
15
17
19
21
23
25
16vN 1
3
5
7
9
11
13
15
17
19
21
17
19
21
23
25
Generation number
8
3
5
Generation number
8
1
3
1
0
0
Generation number
0
1
Generation number
8
0
9vN
1
Generation number
1
Yield (%)
10% noise, 5% outliers
17
19
21
23
11vN
1
25
0
1
3
5
7
9
11
13
15
Generation number
23
25
Effects of GA parameters on the optimization speed
Smoth
Elitisme yes no Pop. Size 24 40 56 Cross-over uniform 1-point 3-point Selection ranking threshold wheel tournement
Noisy
Very noisy
CO oxidation case study
Case study description z
z
Supported bimetallic systems 1 Noble Metal (MN) among Pt, Au, Cu Loading: 0.1-2.1wt% 1 Transition Metal (TM ) among Mo, Nb, V Loading: 1-5wt% 1 support among (CeO2, TiO2, ZrO2)
Testing conditions 3 temperatures 200, 250, 300°C
z
Ternary
Composition
CO Conversion
Mo Nb V Au
Cu
Pt
TiO2 ZrO2 CeO2
Chromosome encoding z
Evolutionary Strategy
z
Algorithm Genetic
InIn the present case, the string is composed of 24 bits. . The 4 discrete variables (Temperature, Support, TM and NM types) were encoded in genes of 4 bits each. As indicated previously, they can take 3 different modalities: Cu, Au and Pt for NM; V, Mo and Nb for TM; Ce, Ti, Zr for Support; and T1, T2 and T3 for the tested temperature. The 2 continuous composition variables were also encoded as genes of 4 bits resulting in 16 steps of 0.06% and 0.25% for the NM (0.1-1.1%) and TM (1-5%) concentrations, respectively. Therefore a subspace defined by a ternary at a given temperature encompasses 162 catalysts and the whole search space more than 20.000 experiments (34 x 162).
String representation of 1%Cu, 3%V, Zr, 200°C catalyst solution
Objective function z z
Single objective = easy Yield is the objective function Y=Conversion x Selectivity Conversion & selectivity have the same weight and are linear
Concept of desirability Multive objective To build an unique indicator of the performance 1
1
0.8
0.8
0.8
0.6
0.6
0.6
0.4
d
1
d d
z
0.4
DSel = 0.001 x S(CO2) + 0.9 0.2
0.2
0
0 0
20
40 60 Conversion (%)
80
100
DConv = 0.0105 x X(CO) if X(CO)< 95% and DConv = 1 if X(CO) ≥ 95%
0
20
40
60
Selectivity (%)
80
100
0.4
DTemp= -0.002 x S(CO2) + 1.4
0.2 0 200
225
250
Tem p °C
Y=DConv x DSel x DTemp
Optimization of Solid Catalyst Using Diversity
Two different diversities z
Diversity based on library fitness (Y) Diversity indicator: Y(mean)/Y(best), std,… Monitoring of crossover & mutation rates When Y(mean)/Y(best) Æ 1 – algorithm is converging – mutation rate shall increase to escape from local maximum
z
Diversity based on catalyst description (X) Diversity indicator: to set rules to compute distance between catalysts To discard/allow new catalysts according to distances Example: Sharing, Tabu, simulated annealing
Baerns: Diversity based on library fitness (Y)
Monitoring crossover & mutation rates
1-(Pmean / Pbest) is used in Baerns ES
(Pmean / Pbest) is used in Baerns ES
Effect of dynamic parameters in ES Crossover = 1-(Pmean / Pbest) Mutation = Pmean / Pbest
Crossover = 1-(0.25*Pmean / Pbest) Mutation = 0.25*Pmean / Pbest
Mean; Whisker: Mean±0.95 Conf. Interval 8
Mean; Whisker: Mean±0.95 Conf. Interval 8
7 7
6 6
5
5
4
4
3
3
2
2
1
1
0
1
4
7
10
13
16
19
22
25
28
0
1
4
7
10
13
Var4
16
19
22
25
28
Var4
Mean; Whisker: Mean±0.95 Conf. Interval 1.0
Mean; Whisker: Mean±0.95 Conf. Interval 1.0
0.8 0.8
0.6
0.6
Crossover Mutation 0.4
0.4 Crossover Mutation
0.2
0.0
0.2
1
4
7
10
13
16 Var4
19
22
25
28
0.0
1
4
7
10
13
16 Var4
19
22
25
28
Two different diversities z
Diversity based on library fitness (Y) Diversity indicator: Y(mean)/Y(best), … Monitoring of crossover & mutation rates When Y(mean)/Y(best) Æ 1 – algorithm is converging – mutation rate shall increase to escape from local maximum
z
Diversity based on catalyst description (X) Diversity indicator: to set rules to compute distance between catalysts Assuming that similar catalysts have similar performances To discard/allow new catalysts according to distances Example: Sharing, Tabu, simulated annealing
How does Tabu perform ? z
Evolution of the population in the search space
z
Browsing: random sample generation
z
Optimization: selection of the best sample decreasing population diversity
Diversity management of populations Tanimoto indice for bit string descriptors
z
Presence/absence of elements more significant than variation of wt% Bit string encoding of catalysts
V Mg B Mo La Mn Fe Ga #A #B
T(A,B) =
NA&B
2
=
NA+NB-NA&B
= 0.5
3+3-2
1.0
1/ln
z
Monitoring the Rate of diversity
Tanimoto Indice
0.8
1/t 0.6
lin
0.4
0.2
0.0
0
50
100
150
200
250
Sample number
300
350
400
Opticat: Tabu Search z
Example Tabu search on Baerns benchmarck
Tabu optimization results
z
80 iterations with 5 catalysts / population (400 catalysts)
z
Optimization evolution 8 7 6 5 4 3 2 1 0 -1 -10 0
Var2 =
8
7
6 6
50 runs
10 20 30 40 50 60 70 80 90 -10 0
10 20 30 40 50 60 70 80 90 -10 0
Var5: 1
10 20 30 40 50 60 70 80 90
Var5: 2
Var5: 3
Yield Var2 (%) Yield (%)
8 5 7 6 5 4 3 2 1 0 -1 -10 0
44 3
10 20 30 40 50 60 70 80 90 -10 0
2 2 8 7 6 5 4 3 2 1 0 -1 -10 0
1 0 0 0 0 10
10 20 30 40 50 60 70 80 90 -10 0
Var5: 4
1010
2020
303010
20 30 40 50 60 70 80 90 -10 0 Var5: 7
40 40
Reliability ?
Var5: 6
50 50
6010 60
20 30 40 50 60 70 80 90 -10 0
Var5: 8number Iteration Iteration
V 4
z
10 20 30 40 50 60 70 80 90
Var5: 5
70 70
80 80
20 30 40 50 60 70 80 90 Var5: 9
Monitoring library diversity in GA (sharing) z z z
To increase the browsing / exploitation ratio To enable to find many local maxima Reliability criteria: % of runs with Y>7.2% (95% of the global maximum)
Algorithm Contest on Reliability iterations population size
z
Simulated Annealing (SA) Tabu Genetic Algorithms
z
Reliability criteria: % of runs with Y>7.2% (95% of the global maximum)
z z
400 80 10
100
1 5 40
8 7.30 +/-0.2
0
Algorithms
GA
Tabu
SA (lin)
SA (1/t)
12
SA (1/ln)
20
28
6
Yield (%)
48
GA + diversity management
61
58
60
Random
Probability Y>7.2
68
40
7
82
80
5 4 3 2 1 0
1
2
3
4
5
6
Iterations
7
8
9
10
Combining Genetic Algorithm and Data Mining
Concept of hybrid algorithm Concept of pre-screening by data mining as iterative screening proceeds
In silico evaluation
Real Population of K catalysts
Evaluation
Final population of K catalysts
selection of K catalysts by ANN
Virtual end criteria
GA
population of µK catalysts
Engine for virtual catalysts design
Implementation in OptiCat Initialization build random population of virtual catalysts
New population of virtual catalysts Bit-flip mutation
Multi point crossover
Tournament Selection
false
Estimation Use the statistical model for estimating the performance of catalysts
Evolution control Synthesize and test promising catalysts
Criterion nb of populations true Best catalyst
Software compatibilities z z
Matlab R12 Statistica 6.1 english version English regional settings & dates