Presentation at Chimiometrie 2009
Bringing the Chemical Knowledge into Empirical Models
Marion Cuny CAMO Software AS mc@camo no
[email protected] www.camo.com
Outline 1. Limits of signal modeling 2 Methodology 2. a) b) c) d)
NIR and MIR data Cross-model validation Background information LPLS
3. Results 4 Conclusions 4. C l i
www.camo.com
Outline 1. Limits of empirical modeling 2 Methodology 2. a) b) c) d)
NIR and MIR data Cross-model validation Background information LPLS
3. Results 4 Conclusions 4. C l i
www.camo.com
Limits of empirical modeling For many years, chemometrics has focused on new, better methods for classification and regression, g , and using variable selection for the optimal classification or the lowest RMSEP. ex: iPLS, moving window iPLS, genetic algorithms best combination search algorithms, search, uncertainty estimates by resampling (bootstrap, jack-knifing)…
www.camo.com
32 marzipan samples • 32 marzipan samples made from 9 recipes were obtained from an industrial batch production. production • Differences in recipes: – p processing g times and different – amounts of almonds, – apricot kernels, – water, – sucrose, – invert sugar, – glucose syrup – plus min or contributions of additives in the marzipan masses. Details about the experimental setup can be found in [1].
• G Goal: l Online O li measurementt off the th Sugar S content t t ranging i from f 35-70% and water content varying from 6-18%. [1] J Christensen Christensen, L Nørgaard, Nørgaard H Heimdal, Heimdal JG Pedersen, Pedersen SB Engelsen. Engelsen Rapid Spectroscopic Analysis of Marzipan – Comparative Instrumentation. Journal of Near Infrared Spectroscopy, vol 12 (1), 2004. www.camo.com
Regression results for sugar content Type of data
Pretreatment
# LVs
RMSECV (Full)
R2 (Validation Full CV)
NIR
No pretreatment
5
1.57
0.986
SNV
4
1.43
0.988
No pretreatment
8
2 01 2.01
0 976 0.976
SNV
8
1.77
0.982
MIR
Sugar molecule: Significant B coefficients
All those wavelenghts can not be related to sugar! www.camo.com
How do you deal with very correlated variables?
PLS2 algorithm. All those wavelenghts can not be related to sugar! g www.camo.com
Goals • To find a good model that can be explained by chemistryy • Include more of the chemistry and biology in the models: – What is the chemistry in the system that is being observed? – Combined the available instrumental data data, to get a better understanding: NIR explained by MIR. – At the same time, apply the chemical background knowledge to confirm the interpretation from the models.
www.camo.com
Outline 1. Limits of empirical modeling 2 Methodology 2. a) b) c) d)
NIR and MIR data Cross-model validation Background information LPLS
3. Results 4 Conclusions 4. C l i
www.camo.com
MIR and NIR spectroscopy of marzipan samples • FT-IR (Perkin-Elmer System 2000 1700-600 cm-11) and VIS/NIR (FOSS NIR System 6500 instrument 400-2500 nm) spectroscopy. spectroscopy
MIR spectra (left) and VIS/NIR spectra (right) of marzipan samples MIR spectra are considered as containing the fundamental chemical information
NIR spectra contain overtones of the fundamental chemical information www.camo.com
Preprocessing • Standard normal variates ((SNV), ), which is equivalent q to centering and normalizing each spectrum to unit variance.
x ikSNV = ( x ik − m i ) / s i
www.camo.com
Selection of variables • PLS regression between X=MIR and Y=NIR • Study of the significance of the coefficient tested by jack-knifing. • Validation method=cross-model method=cross model validation (CMV)
www.camo.com
Outline 1. Limits of empirical modeling 2 Methodology 2. a) b) c) d)
NIR and MIR data Cross-model validation Background information LPLS
3. Results 4 Conclusions 4. C l i
www.camo.com
X
Y
Selected variables Significant coefficient Crossvalidation
Selected variables
Crossmodel validation
www.camo.com
Cross-model validation (CMV) with variable selection based on jack jack-knifing knifing
0. Cross-validation on all objects 1. Take out e.g. 10% of the objects 2. Cross-validate the remaining 3 Find significant variables 3. 4. Predict the objects that were kept out CMV CV 5. Estimate RMSE (or explained variance) 6. Repeat 1 - 5 until all objects have been taken out 7. Show frequency y og g significance g for all variables 9. Collect and predict an independent testset!
X
www.camo.com
Y
Outline 1. Limits of empirical modeling 2 Methodology 2. a) b) c) d)
NIR and MIR data Cross-model validation Background information LPLS
3. Results 4 Conclusions 4. C l i
www.camo.com
Band assignment table • •
8 group frequencies were assigned to their respective vibrational regions from an NIR-chart NIR chart, and an initial binary matrix was built using 1 to indicate peaks and 0 elsewhere. In addition, peaks were weighted either as weak (0.5), normal (1), strong ((2)) or very y strong g ((4). ) The band assignment g matrix was then convolved with a Gaussian filter of size N=30 (corresponding to 60 nm when the spectral resolution is equal to 2 nm) for each group.
Functtional grou up
Sugar molecule:
Expected p wavelengths g Wavelength, nm www.camo.com
Outline 1. Limits of empirical modeling 2 Methodology 2. a) b) c) d)
NIR and MIR data Cross-model validation Background information LPLS
3. Results 4 Conclusions 4. C l i
www.camo.com
L-shape PLS regression • Regular two-block PLS: X-weights (used to calculate scores and loadings) can be obtained from the A first eigenvectors of:
• Three-block / L-PLS: Weights for X and Z (used to find X and Z scores and loadings) may be obtained from the A first eigenvectors of:
H Martens Martens., E Anderssen, Anderssen A Flatberg, Flatberg L.H. L H Gidskehaug, Gidskehaug M Høy, Høy F Westad Westad, A Thybo, Thybo M Martens. Martens Regressing a matrix on descriptors of both its rows and of its columns, by low-rank L-PLS Regression. Computational Statistics and Data Analysis, 48, 103-125, 2005. www.camo.com
Structure of the L-PLS data •
One interesting aspect of L-PLSR is that the regression model is based on the inherent link between the actual spectra and theoretical band assignment giving direct “chemical” assignment, chemical interpretation. interpretation Naturally Naturally, with broad bands like in NIR, the assignments are rather crude, and one should always interpret the results in the light of the chemical background knowledge By applying this procedure in e knowledge. e.g. g MIR spectroscopy spectroscopy, a more detailed interpretation compared to NIR would then be possible.
MIR www.camo.com
Outline 1. Limits of empirical modeling 2 Methodology 2. a) b) c) d)
NIR and MIR data Cross-model validation Background information LPLS
3. Results 4 Conclusions 4. C l i
www.camo.com
Selection of variable by CMV • The 32 marzipan samples were subject to repeated CMV ((100 runs. Uncertainty y estimates from jjackknifing). • 3 PLSR components was the basis for significance tests at 5% level. • The main results are shown as a map of frequency of significance for the regression coefficient matrix B. Sugar and water constitute the chemical compounds of interest. www.camo.com
Frequency of significance from repeated CMV Frequency of significance from repeated CM V 600
90
800
Wavelengtth (nm), NIR
100
80
OH
1000
70
CH2 stretch
1200
1400
60
OH and CH combination
50
1600
40
1800
20
2200
10
2400 700
800
900
1000
1100
1200
The region 1400–1500 nm is related to both O-H stretch vibrations and C-H combinations.
30
OH
2000
C-H deformation at 820 cm−1 corresponds to the NIR region around d 2200 nm.
1300
1400
1500
1600
0
The main peaks in the bands that were significant were selected as input to the L-PLS L PLS regression.
W avenumber (cm-1), FT-IR MIR
O-H deformation C-O stretch 700-800 cm-1 1030 cm-1 O-H stretch 820 cm-1 www.camo.com
Background: selected peaks 0 000 0.000
0 200 0.200
0 400 0.400
0 600 0.600
0 800 0.800
1 000 1.000
Matrix Plot H20 (O-H) CH C-H C-H2 C-H3 R-OH ArOH C=C N-H
www.camo.com
2324
2154
2094
1894
1864
1714
1584
1434
1404
1204
984
Euro nir + assign selected - Matrix Plot, Sam.Set: Selected Samples, Var.Set: NIR selected
Data structure 0 .0 00
0 .2 00
0 .4 00
0 .6 00
0 .8 00
1 .0 00
Ma trix Plo t H2 0 (O-H) C-H C-H2 C-H3 R-OH ArOH C=C N-H
2 32 4
2 15 4
2 09 4
1 89 4
1 86 4
1 71 4
1 58 4
1 43 4
1 40 4
1 20 4
9 84
Eu ro ni r + as s ig n s el ec te d - Ma trix Pl ot, Sa m .Se t: Se le cted Sa m p le s , Var.Se t: NIR s e le cte d
Line Plot
25 2.5
Line Plot
1.5
2.0
1.0
0.5 1.5
0 1.0
-0.5 05
-1.0 0.5
1234567891011121314151617181920212223242526272829303132
www.camo.com
750 cm--1
822 cm--1
864 cm--1
906 cm--1
927 cm--1
978 cm--1
993 cm--1
1032 cm m-1
1065 cm m-1
1083 cm m-1
23 324
21 154
20 094
18 894
18 864
17 714
15 584
14 434
14 404
12 204
984
1234567891011121314151617181920212223242526272829303132
Variables
1137 cm m-1
-1.5
Variables
Correlation loading plot •
Direct interpretation of the chemistry and the actual spectral regions that were found to be significant.
1.0
PC2
Correlation Loadings (X and Y) C-H
05 0.5
•
•
Note that sugar and water content are inversely correlated in the samples themselves, so there is 0 both a concentration dependence as % content as wellll as a chemical h i l dependence in terms of -0.5 OH bands.
1204 nm
Dummy variables for samples
C H2 C-H2 1714 nm 1864 nm 1404 nm A OH ArOH 1137 cm-1 H20 1894(O-H) nm 1434 nm Moisture
984 nm 23 2415 25 14 564 7 12 8 9 3 27 17 19 20 28 16 26 18 12
C-H3 2324 nm 2154 nm 2094 nmC=C N-H 1083 cm-1 cm Sugar S1 906 cm-1 1065 cm-1
29 21 32 30 22 31
1032 cm-1
CO-strech
864 cm cm-1 cm 11 978 cm-1
11 10 13
R-OH
OH t OH-strech h
Legend: – – – –
MIR wavelenghts, NIR wavelenght, Chemical background: g wavelenght g -1.0 table, Composition
822 cm-1 927 cm-1 750 cm-1
993 cm-1 PC1
-1.0
-0.8 -0.6 -0.4 -0.2 0 0.2 X-expl: 57%,33% Y-expl: 72%,13%
0.4
0.6
www.camo.com
0.8
1.0
Regression on the selected OH peaks
70
The –OH OH peaks to be used to measure the sugar contents are: 2094 2154 2094, 2154, 2324 nm nm.
Predicted Y Slope Offset RMSE R-Square 0.9772281.0532591.9134490.977228 0.9747541.1777932.1436890.973177
22 31 2232 23 29 21 21
30
60 28 26
27
50 13 10 11 40
30
19 37 25 486242 5
20 18 19 19
17 16 14 15 12
30 35 40 45 RESULT10, (Y-var, PC):(Sugar,2) (Sugar,2)
Measured Y 50
55
60
Results may not be better (in this case they are not) but you are able to say why you are using those wavelengths! www.camo.com
65
70
Outline 1. Limits of empirical modeling 2 Methodology 2. a) b) c) d)
NIR and MIR data Cross-model validation Background information LPLS
3. Results 4 Conclusions 4. C l i
www.camo.com
Conclusions •
•
•
•
Cross model validation with jack-knife estimates is an efficient way of removing variables that are not of interest, and the relation between two instrumental methods can be presented as a color image L-PLS regression gives direct interpretation of the underlying chemistry which is useful for confirming existing knowledge but also for finding unknown k phenomena h which hi h may llead d tto iinnovation ti and d ffurther th research. h The correlation loading plot is a very condensed way of visualising all of interest for the three data tables as well as constituents such as water,, sugar, fat. The method is general and can be applied for genetics (microarray, SNP) spectroscopy (MIR (MIR, NIR NIR, Raman Raman, NMR) NMR), and other types of data where the variables have known characteristics from the basic theory, e.g. chemistry and biology.
www.camo.com
Perspectives • L-PLS is adapted to other types of data: – NMR Z: information on shift hift assignment i t
Y: NMR measurement
X: information on the samples or other measurements
www.camo.com
Perspectives • L-PLS is adapted to other types of data: – metabolomic Z: information on phenotype h
Y: metabolomic measurement
X: information on the samples p
www.camo.com
Perspectives • L-PLS is adapted to other types of data: – sensoryy Z: information on consumer
Y: consumer ppreference
X: information on the samples p
www.camo.com
Get value out of your data
Thank you for your attention Marion Cuny
[email protected]
Other webinars: http://www.camo.com/training/webinars‐seminars.html Other webinars: http://www camo com/training/webinars seminars html Recorded webinars: http://www.camo.com/training/archives.html www.camo.com