Bringing the Chemical Knowledge into Empirical Models - Marion Cuny

mc@camo no ... b) Cross-model validation c) Background information d) LPLS. 3. Results. 4 C. l i. 4. ... No pretreatment. 5 ... All those wavelenghts can not be related ..... spectroscopy (MIR NIR Raman NMR) and other types of data where the.
584KB taille 6 téléchargements 326 vues
Presentation at Chimiometrie 2009

Bringing the Chemical Knowledge into Empirical Models

Marion Cuny CAMO Software AS mc@camo no [email protected] www.camo.com

Outline 1. Limits of signal modeling 2 Methodology 2. a) b) c) d)

NIR and MIR data Cross-model validation Background information LPLS

3. Results 4 Conclusions 4. C l i

www.camo.com

Outline 1. Limits of empirical modeling 2 Methodology 2. a) b) c) d)

NIR and MIR data Cross-model validation Background information LPLS

3. Results 4 Conclusions 4. C l i

www.camo.com

Limits of empirical modeling For many years, chemometrics has focused on new, better methods for classification and regression, g , and using variable selection for the optimal classification or the lowest RMSEP. ex: iPLS, moving window iPLS, genetic algorithms best combination search algorithms, search, uncertainty estimates by resampling (bootstrap, jack-knifing)…

www.camo.com

32 marzipan samples • 32 marzipan samples made from 9 recipes were obtained from an industrial batch production. production • Differences in recipes: – p processing g times and different – amounts of almonds, – apricot kernels, – water, – sucrose, – invert sugar, – glucose syrup – plus min or contributions of additives in the marzipan masses. Details about the experimental setup can be found in [1].

• G Goal: l Online O li measurementt off the th Sugar S content t t ranging i from f 35-70% and water content varying from 6-18%. [1] J Christensen Christensen, L Nørgaard, Nørgaard H Heimdal, Heimdal JG Pedersen, Pedersen SB Engelsen. Engelsen Rapid Spectroscopic Analysis of Marzipan – Comparative Instrumentation. Journal of Near Infrared Spectroscopy, vol 12 (1), 2004. www.camo.com

Regression results for sugar content Type of data

Pretreatment

# LVs

RMSECV (Full)

R2 (Validation Full CV)

NIR

No pretreatment

5

1.57

0.986

SNV

4

1.43

0.988

No pretreatment

8

2 01 2.01

0 976 0.976

SNV

8

1.77

0.982

MIR

Sugar molecule: Significant B coefficients

All those wavelenghts can not be related to sugar! www.camo.com

How do you deal with very correlated variables?

PLS2 algorithm. All those wavelenghts can not be related to sugar! g www.camo.com

Goals • To find a good model that can be explained by chemistryy • Include more of the chemistry and biology in the models: – What is the chemistry in the system that is being observed? – Combined the available instrumental data data, to get a better understanding: NIR explained by MIR. – At the same time, apply the chemical background knowledge to confirm the interpretation from the models.

www.camo.com

Outline 1. Limits of empirical modeling 2 Methodology 2. a) b) c) d)

NIR and MIR data Cross-model validation Background information LPLS

3. Results 4 Conclusions 4. C l i

www.camo.com

MIR and NIR spectroscopy of marzipan samples • FT-IR (Perkin-Elmer System 2000 1700-600 cm-11) and VIS/NIR (FOSS NIR System 6500 instrument 400-2500 nm) spectroscopy. spectroscopy

MIR spectra (left) and VIS/NIR spectra (right) of marzipan samples MIR spectra are considered as containing the fundamental chemical information

NIR spectra contain overtones of the fundamental chemical information www.camo.com

Preprocessing • Standard normal variates ((SNV), ), which is equivalent q to centering and normalizing each spectrum to unit variance.

x ikSNV = ( x ik − m i ) / s i

www.camo.com

Selection of variables • PLS regression between X=MIR and Y=NIR • Study of the significance of the coefficient tested by jack-knifing. • Validation method=cross-model method=cross model validation (CMV)

www.camo.com

Outline 1. Limits of empirical modeling 2 Methodology 2. a) b) c) d)

NIR and MIR data Cross-model validation Background information LPLS

3. Results 4 Conclusions 4. C l i

www.camo.com

X

Y

Selected variables Significant coefficient Crossvalidation

Selected variables

Crossmodel validation

www.camo.com

Cross-model validation (CMV) with variable selection based on jack jack-knifing knifing

0. Cross-validation on all objects 1. Take out e.g. 10% of the objects 2. Cross-validate the remaining 3 Find significant variables 3. 4. Predict the objects that were kept out CMV CV 5. Estimate RMSE (or explained variance) 6. Repeat 1 - 5 until all objects have been taken out 7. Show frequency y og g significance g for all variables 9. Collect and predict an independent testset!

X

www.camo.com

Y

Outline 1. Limits of empirical modeling 2 Methodology 2. a) b) c) d)

NIR and MIR data Cross-model validation Background information LPLS

3. Results 4 Conclusions 4. C l i

www.camo.com

Band assignment table • •

8 group frequencies were assigned to their respective vibrational regions from an NIR-chart NIR chart, and an initial binary matrix was built using 1 to indicate peaks and 0 elsewhere. In addition, peaks were weighted either as weak (0.5), normal (1), strong ((2)) or very y strong g ((4). ) The band assignment g matrix was then convolved with a Gaussian filter of size N=30 (corresponding to 60 nm when the spectral resolution is equal to 2 nm) for each group.

Functtional grou up

Sugar molecule:

Expected p wavelengths g Wavelength, nm www.camo.com

Outline 1. Limits of empirical modeling 2 Methodology 2. a) b) c) d)

NIR and MIR data Cross-model validation Background information LPLS

3. Results 4 Conclusions 4. C l i

www.camo.com

L-shape PLS regression • Regular two-block PLS: X-weights (used to calculate scores and loadings) can be obtained from the A first eigenvectors of:

• Three-block / L-PLS: Weights for X and Z (used to find X and Z scores and loadings) may be obtained from the A first eigenvectors of:

H Martens Martens., E Anderssen, Anderssen A Flatberg, Flatberg L.H. L H Gidskehaug, Gidskehaug M Høy, Høy F Westad Westad, A Thybo, Thybo M Martens. Martens Regressing a matrix on descriptors of both its rows and of its columns, by low-rank L-PLS Regression. Computational Statistics and Data Analysis, 48, 103-125, 2005. www.camo.com

Structure of the L-PLS data •

One interesting aspect of L-PLSR is that the regression model is based on the inherent link between the actual spectra and theoretical band assignment giving direct “chemical” assignment, chemical interpretation. interpretation Naturally Naturally, with broad bands like in NIR, the assignments are rather crude, and one should always interpret the results in the light of the chemical background knowledge By applying this procedure in e knowledge. e.g. g MIR spectroscopy spectroscopy, a more detailed interpretation compared to NIR would then be possible.

MIR www.camo.com

Outline 1. Limits of empirical modeling 2 Methodology 2. a) b) c) d)

NIR and MIR data Cross-model validation Background information LPLS

3. Results 4 Conclusions 4. C l i

www.camo.com

Selection of variable by CMV • The 32 marzipan samples were subject to repeated CMV ((100 runs. Uncertainty y estimates from jjackknifing). • 3 PLSR components was the basis for significance tests at 5% level. • The main results are shown as a map of frequency of significance for the regression coefficient matrix B. Sugar and water constitute the chemical compounds of interest. www.camo.com

Frequency of significance from repeated CMV Frequency of significance from repeated CM V 600

90

800

Wavelengtth (nm), NIR

100

80

OH

1000

70

CH2 stretch

1200

1400

60

OH and CH combination

50

1600

40

1800

20

2200

10

2400 700

800

900

1000

1100

1200

The region 1400–1500 nm is related to both O-H stretch vibrations and C-H combinations.

30

OH

2000

C-H deformation at 820 cm−1 corresponds to the NIR region around d 2200 nm.

1300

1400

1500

1600

0

The main peaks in the bands that were significant were selected as input to the L-PLS L PLS regression.

W avenumber (cm-1), FT-IR MIR

O-H deformation C-O stretch 700-800 cm-1 1030 cm-1 O-H stretch 820 cm-1 www.camo.com

Background: selected peaks 0 000 0.000

0 200 0.200

0 400 0.400

0 600 0.600

0 800 0.800

1 000 1.000

Matrix Plot H20 (O-H) CH C-H C-H2 C-H3 R-OH ArOH C=C N-H

www.camo.com

2324

2154

2094

1894

1864

1714

1584

1434

1404

1204

984

Euro nir + assign selected - Matrix Plot, Sam.Set: Selected Samples, Var.Set: NIR selected

Data structure 0 .0 00

0 .2 00

0 .4 00

0 .6 00

0 .8 00

1 .0 00

Ma trix Plo t H2 0 (O-H) C-H C-H2 C-H3 R-OH ArOH C=C N-H

2 32 4

2 15 4

2 09 4

1 89 4

1 86 4

1 71 4

1 58 4

1 43 4

1 40 4

1 20 4

9 84

Eu ro ni r + as s ig n s el ec te d - Ma trix Pl ot, Sa m .Se t: Se le cted Sa m p le s , Var.Se t: NIR s e le cte d

Line Plot

25 2.5

Line Plot

1.5

2.0

1.0

0.5 1.5

0 1.0

-0.5 05

-1.0 0.5

1234567891011121314151617181920212223242526272829303132

www.camo.com

750 cm--1

822 cm--1

864 cm--1

906 cm--1

927 cm--1

978 cm--1

993 cm--1

1032 cm m-1

1065 cm m-1

1083 cm m-1

23 324

21 154

20 094

18 894

18 864

17 714

15 584

14 434

14 404

12 204

984

1234567891011121314151617181920212223242526272829303132

Variables

1137 cm m-1

-1.5

Variables

Correlation loading plot •

Direct interpretation of the chemistry and the actual spectral regions that were found to be significant.

1.0

PC2

Correlation Loadings (X and Y) C-H

05 0.5





Note that sugar and water content are inversely correlated in the samples themselves, so there is 0 both a concentration dependence as % content as wellll as a chemical h i l dependence in terms of -0.5 OH bands.

1204 nm

Dummy variables for samples

C H2 C-H2 1714 nm 1864 nm 1404 nm A OH ArOH 1137 cm-1 H20 1894(O-H) nm 1434 nm Moisture

984 nm 23 2415 25 14 564 7 12 8 9 3 27 17 19 20 28 16 26 18 12

C-H3 2324 nm 2154 nm 2094 nmC=C N-H 1083 cm-1 cm Sugar S1 906 cm-1 1065 cm-1

29 21 32 30 22 31

1032 cm-1

CO-strech

864 cm cm-1 cm 11 978 cm-1

11 10 13

R-OH

OH t OH-strech h

Legend: – – – –

MIR wavelenghts, NIR wavelenght, Chemical background: g wavelenght g -1.0 table, Composition

822 cm-1 927 cm-1 750 cm-1

993 cm-1 PC1

-1.0

-0.8 -0.6 -0.4 -0.2 0 0.2 X-expl: 57%,33% Y-expl: 72%,13%

0.4

0.6

www.camo.com

0.8

1.0

Regression on the selected OH peaks

70

The –OH OH peaks to be used to measure the sugar contents are: 2094 2154 2094, 2154, 2324 nm nm.

Predicted Y Slope Offset RMSE R-Square 0.9772281.0532591.9134490.977228 0.9747541.1777932.1436890.973177

22 31 2232 23 29 21 21

30

60 28 26

27

50 13 10 11 40

30

19 37 25 486242 5

20 18 19 19

17 16 14 15 12

30 35 40 45 RESULT10, (Y-var, PC):(Sugar,2) (Sugar,2)

Measured Y 50

55

60

Results may not be better (in this case they are not) but you are able to say why you are using those wavelengths! www.camo.com

65

70

Outline 1. Limits of empirical modeling 2 Methodology 2. a) b) c) d)

NIR and MIR data Cross-model validation Background information LPLS

3. Results 4 Conclusions 4. C l i

www.camo.com

Conclusions •







Cross model validation with jack-knife estimates is an efficient way of removing variables that are not of interest, and the relation between two instrumental methods can be presented as a color image L-PLS regression gives direct interpretation of the underlying chemistry which is useful for confirming existing knowledge but also for finding unknown k phenomena h which hi h may llead d tto iinnovation ti and d ffurther th research. h The correlation loading plot is a very condensed way of visualising all of interest for the three data tables as well as constituents such as water,, sugar, fat. The method is general and can be applied for genetics (microarray, SNP) spectroscopy (MIR (MIR, NIR NIR, Raman Raman, NMR) NMR), and other types of data where the variables have known characteristics from the basic theory, e.g. chemistry and biology.

www.camo.com

Perspectives • L-PLS is adapted to other types of data: – NMR Z: information on shift hift assignment i t

Y: NMR measurement

X: information on the samples or other measurements

www.camo.com

Perspectives • L-PLS is adapted to other types of data: – metabolomic Z: information on phenotype h

Y: metabolomic measurement

X: information on the samples p

www.camo.com

Perspectives • L-PLS is adapted to other types of data: – sensoryy Z: information on consumer

Y: consumer ppreference

X: information on the samples p

www.camo.com

Get value out of your data

Thank you for your attention Marion Cuny  [email protected]

Other webinars:  http://www.camo.com/training/webinars‐seminars.html Other webinars: http://www camo com/training/webinars seminars html Recorded webinars:  http://www.camo.com/training/archives.html www.camo.com