28th-29th November 2007
Discrimination of balsamic vinegar production methods: Evolving Window Zone Selection S and Interval PLS-Cluster SC on 1H NMR spectra, followed by ICA Marion Cuny1,2, Delphine Bouveresse3, Michèle Lees1, Douglas Rutledge2 1: Eurofins
Scientific Analytics Analytics, Nantes Nantes, FR 2:
AgroParisTech, Paris, FR 3:
INRA P INRA, Paris, i FR www.eurofins.com
Traditional Balsamic Vinegar (TB) vs. Balsamic Vinegar (Ba)
2 areas: •Modena •Reggio Emilia PDO Lyon, 29th November 2007
2
Traditional Balsamic Vinegar (TB) vs. Balsamic Vinegar (Ba)
Production: TB: cooked must from Trebbiano grape varieties Ba: cooked must diluted with wine and caramel (2%)
Aging: TB: aged at least 12 years in barrels of different kinds of wood and of decreasing volume Ba: 60 days in wooden recipients
Price variations leading g to counterfeiting g Need for authentication methods Lyon, 29th November 2007
3
1H NMR spectroscopy
Simplified Si lifi d sample l preparation ti Rapid measurement Screening of all protonated molecules
Lyon, 29th November 2007
4
Presentation outline
1. Samples and data sets 2 Chemometric 2. Ch t i methods th d 3. Discrimination following Evolving Window Zone Selection 4. Discrimination following Interval-PLS_Cluster 5. Conclusion
Lyon, 29th November 2007
5
1. Samples and data sets 2 Chemometric 2. Ch t i methods th d 3. Discrimination following Evolving Window Zone Selection 4. Discrimination following Interval-PLS_Cluster 5. Conclusion
Lyon, 29th November 2007
6
Samples and Spectrum acquisition Samples :
1.
Sample and data sets
2.
Chemometric methods
3.
Discrimination with EWZS
4.
Discrimination with Interval PLS_Cluster PLS Cluster
5.
Conclusion
15 TB / 14 Ba Ba: 7 from the market and 7 from producers in Emilia. 3 are aged. Sample preparation: Dipartimento di Chimica Organica e Industriale, Università di Parma. Parma Dilution of vinegar in D2O + TSP Data acquisition :
noesypr1d NS= NS 32 AQ= 1.4 sec Experiment duration: 3 3.5 5 min Lyon, 29th November 2007
7
Data pre-treatment
1.
Sample and data sets
2.
Chemometric methods
3.
Discrimination with EWZS
4.
Discrimination with Interval PLS_Cluster PLS Cluster
5.
Conclusion
Phase correction Baseline correction Warping Mean of 7 adjacent points : 32 K Æ 4692 variables Edge, Water and TSP suppression 4692 Æ 3539 variables Log transformation
Lyon, 29th November 2007
8
Data set
1.
Sample and data sets
2.
Chemometric methods
3.
Discrimination with EWZS
4.
Discrimination with Interval PLS_Cluster PLS Cluster
5.
Conclusion
An X(n,p) ( p) matrix of data containing g the values to trace the spectra p
X(n,p)
Recorded spectra p 8
x 10
Variable 1
Variable 2
Variable p
6
5
Sample 1
x1,1
x1,2
x1,p
4
Sample 2
x2,1
x2,2
x2,p
3
2
Sample p n
xn,1
xn,2
xn,p
1
0
10
9
8
7
6
5
4
3
2
1
ppm
Lyon, 29th November 2007
9
Data set after pre-treatment
1.
Sample and data sets
2.
Chemometric methods
3.
Discrimination with EWZS
4.
Discrimination with Interval PLS_Cluster PLS Cluster
5.
Conclusion
Complete dataset after pre-treatment
12
11
10
9
8
7
6
5
4 10
Lyon, 29th November 2007
9
8
7
6
5
4
3
2
1
0
10
1 Sample 1. S l and d data d t sets t 2. Chemometric methods 3. Discrimination following Evolving Window Zone Selection 4. Discrimination following Interval-PLS_Cluster 5 Conclusion 5.
Lyon, 29th November 2007
11
Why using variable selection? 3
x 10
4
1.
Sample and data sets
2.
Chemometric methods
3.
Discrimination with EWZS
4.
Discrimination with Interval PLS_Cluster
5.
Conclusion
V in e g a r s p e c t ru m
2 .5 Aromatic compounds Sugars
Acids
2
1 .5
1
0 .5
0
-0 .5 10
9
8
7
6
5 ppm
4
3
2
1
0
The relevant information may be anywhere in the spectrum No part of it should be neglected Lyon, 29th November 2007
12
Methodology x
3
2
.
5
1
.
5
0
.
5
0
.
5
1
0
4
V
1.
Sample and data sets
2.
Chemometric methods
3.
Discrimination with EWZS
4.
Discrimination with Interval PLS_Cluster
5.
Conclusion
i n
e
g
a
r
s
p
e
c
t
r
u
m
2
1
0
-
1
0
9
8
7
6
5 p
p
4
3
2
1
0
m
Variable selection EWZS
Interval-PLS_Cluster
ICA Prediction model Lyon, 29th November 2007
13
Selection of info : EWZS function 5
Dataset = Xn,2500
x 10
6
1.
Sample and data sets
2.
Chemometric methods
3.
Discrimination with EWZS
4.
Discrimination with Interval PLS_Cluster PLS Cluster
5.
Conclusion
4
Value to predict = Yn,1
3 2 1 0 0
1000
2000
Evolving Window Zone Selection function: A growing sliding window to test zone's ability to predict Y
Xn,1:500
Legend : Step =500 Mi i l size Minimal i = 500 Maximal size = 1500
Spectum Observation window
6
x 10 5 4 3 2 1 0 0
1000
2000
1st zone tested
In press : ACA,2007, Cuny,M. et al., Evolving Window Zone Selection method followed by Independent Component Analysis as useful
Lyon, 29th November 2007 chemometric tools to discriminate between grapefruit juice, orange juice and blends
14
Selection of info : EWZS function 5
x 10
6
1.
Sample and data sets
2.
Chemometric methods
3.
Discrimination with EWZS
4.
Discrimination with Interval PLS_Cluster PLS Cluster
5.
Conclusion
4
Dataset = Xn,2500
Value to predict = Yn,1
3 2 1 0 0
1000
2000
Evolving Window Zone Selection function: A growing sliding window to test zone's ability to predict Y
Xn,1:500
Xn,1:1000
Legend : Step =500 Mi i l size Minimal i = 500
Spectum Observation window
Maximal size = 1500 6
x 10 5 4 3 2 1 0 0
6
1000
2000
x 10 5 4 3 2 1 0 0
1000
2000
2nd zone tested
In press : ACA,2007, Cuny,M. et al., Evolving Window Zone Selection method followed by Independent Component Analysis as useful
Lyon, 29th November 2007 chemometric tools to discriminate between grapefruit juice, orange juice and blends
15
Selection of info : EWZS function 5
x 10
6
1.
Sample and data sets
2.
Chemometric methods
3.
Discrimination with EWZS
4.
Discrimination with Interval PLS_Cluster PLS Cluster
5.
Conclusion
4
Dataset = Xn,2500
Value to predict = Yn,1
3 2 1 0 0
1000
2000
Evolving Window Zone Selection function: A growing sliding window to test zone's ability to predict Y
Growing
Legend : Step =500 Mi i l size Minimal i = 500
6
x 10 5 4 3 2 1 0 0
6
1000
2000
6
Sliding
Xn,1:500
Xn,1:1000
Xn,1:1500
Xn,500:1000
Xn,500:1500
Xn,500:2000
x 10 5 4 3 2 1 0 0
…
Xn,1000:2500
x 10 5 4 3 2 1 0 0
x 10 5 4 3 2 1 0 0
6
1000
2000
6
1000
2000
6
Xn,1000:1500 Xn,1000:2000
Spectum Observation window
Maximal size = 1500
x 10 5 4 3 2 1 0 0
1000
2000
1000
2000
1000
2000
1000
2000
6
1000
2000
6
x 10 5 4 3 2 1 0 0
x 10 5 4 3 2 1 0 0
x 10 5 4 3 2 1 0 0
6
1000
2000
x 10 5 4 3 2 1 0 0
In press : ACA,2007, Cuny,M. et al., Evolving Window Zone Selection method followed by Independent Component Analysis as useful
Lyon, 29th November 2007 chemometric tools to discriminate between grapefruit juice, orange juice and blends
16
Selection of info : EWZS function
1.
Sample and data sets
2.
Chemometric methods
3.
Discrimination with EWZS
4.
Discrimination with Interval PLS_Cluster PLS Cluster
5.
Conclusion
Data reduction
Xn,1:500 n 1:500
Xn,1:1000 n 1:1000
Xn,1:1500 n 1:1500
Prediction
ICA PLS
Xn,500:1000
Xn,500:1500
Xn,500:2000
PCA PLS-DA PLS DA
Xn,1000:1500 Xn,1000:2000
Xn,1000:2500
Criterion
PLS Linear Regression
Yn,1
RMSECV R²
…
…
RMSECV (Xn,1:500)
RMSECV (Xn,1:1000)
RMSECV (Xn,1:1500 )
R² (Xn,1:500)
R² (Xn,1:1000)
R² (Xn,1:1500 )
RMSECV (Xn,500:1000)
RMSECV (Xn,500:1500)
RMSECV (Xn,500:2000 )
R² (Xn,500:1000)
R² (Xn,500:1500)
R² (Xn,500:2000 )
RMSECV (Xn,1000:1500) RMSECV (Xn,1000:2000) RMSECV (Xn,1000:2500 )
R² (Xn,1000:1500) R
R² R (Xn,1000:2000)
R R² (Xn,1000:2500 )
…
…
MAP
MAP RMSECV value
Selection
R² value
Selected zone = Xn,500:1500 …
… In press : ACA,2007, Cuny,M. et al., Evolving Window Zone Selection method followed by Independent Component Analysis as useful
Lyon, 29th November 2007 chemometric tools to discriminate between grapefruit juice, orange juice and blends
17
Interval-PLS_Cluster
1.
Sample and data sets
2.
Chemometric methods
3.
Discrimination with EWZS
4.
Discrimination with Interval PLS_Cluster PLS Cluster
5.
Conclusion
1) Apply the generalised PLS_Cluster algorithm g across the data set within a window moving 2) A dendrogram is obtained for each window 3) Selection of the “best” dendrograms gives a selection of informative intervals
Lyon, 29th November 2007
18
Generalised PLS_Cluster (1)
1.
Sample and data sets
2.
Chemometric methods
3.
Discrimination with EWZS
4.
Discrimination with Interval PLS_Cluster PLS Cluster
5.
Conclusion
Recursive method based on PLS
The data set is first separated into C1 groups
Each group is then split into C2i sub-groups, etc.
Iterated until all sub sub-groups groups are singletons
The number of clusters Ci is not fixed : - determined at each node of the dendrogram - based on the data structure
Lyon, 29th November 2007
19
Generalised PLS_Cluster (2)
1.
Sample and data sets
2.
Chemometric methods
3.
Discrimination with EWZS
4.
Discrimination with Interval PLS_Cluster PLS Cluster
5.
Conclusion
Original data are in X(n x p)
1. Set the first y vector (n x 1) to y = u1 ((normed PC1 score vector of column-centered X). )
2. Membership_function (y) 3 Calculate a 1-LV PLS model between X and y 3. Æ regression coefficients b (p x 1) Æ fitted ŷ* = X b 4. If y* has not stabilized
Lyon, 29th November 2007
Æ
replace y by y*
Æ
go to step g p3 20
Determination of C
1.
Sample and data sets
2.
Chemometric methods
3.
Discrimination with EWZS
4.
Discrimination with Interval PLS_Cluster PLS Cluster
5.
Conclusion
After convergence : (before applying the membership function) the ŷ elements are sorted by increasing value Calculate the difference between consecutive elements y value 1
0 25 0.25
0.9 0.8
Depending D di on th threshold h ld value, l two or three clusters are detected here.
0.2
0.7 0.6
0.15
0.5 0.4
0.1
0.3
Usually set to 3 * median
0.2 0.1 0
Lyon, 29th November 2007
0.05
0
5
10
15
20
0
0
5
10
15
20
21
Result on a particular interval
1.
Sample and data sets
2.
Chemometric methods
3.
Discrimination with EWZS
4.
Discrimination with Interval PLS_Cluster PLS Cluster
5.
Conclusion
12 10 8 6 4 2
0
500
1000
1500
2000
2500
3000
3500
4000
20 10 0 -10 -20
0
Lyon, 29th November 2007
10
20
30 Level
40
50
60
22
Independent y Component Analysis
1.
Sample and data sets
2.
Chemometric methods
3.
Discrimination with EWZS
4.
Discrimination with Interval PLS_Cluster PLS Cluster
5.
Conclusion
Aim : Recover the “pure” sources from mixed signals Based on the assumption of statistical independence of underlying ((“pure”) pure ) sources How ? By finding a demixing transformation that minimises dependencies among the estimates of the "pure" pure sources A Hyvarinen A. Hyvarinen, JJ. Karhunen Karhunen, and E E. Oja Oja, Independent Component Analysis Analysis, Wiley Wiley, New York York, 2001. 2001
Lyon, 29th November 2007
23
Classification with ICA
1.
Sample and data sets
2.
Chemometric methods
3.
Discrimination with EWZS
4.
Discrimination with Interval PLS_Cluster PLS Cluster
5.
Conclusion
1) ICA model on a matrix X(n-1,p) of n-1 individuals and p variables, using m components gives:
The m signals (loadings) : S(p,m), The new coordinates of the n-1 individuals in the new base P(n-1,m) 2) Projection of a new sample J(1,p) on the m signals: B(1,m) = J(1,p)* S(p,m) * inv(S(p,m)' * S(p,m)) 3) Calculation of the Mahalanobis distance to the barycenters of the g groups p 4) Classification in the nearest group 5) Rate of correct classification Lyon, 29th November 2007
24
1. Sample and data sets 2 Chemometric 2. Ch t i methods th d 3. Discrimination following Evolving Window Zone Selection 4. Discrimination following Interval-PLS_Cluster 5. Conclusion
Lyon, 29th November 2007
25
Results of EWZS function Max Regression R²
0
R² 0.7
500
0.6 1000 0.5
1.
Sample and data sets
2.
Chemometric methods
3.
Discrimination with EWZS
4.
Discrimination with Interval PLS_Cluster PLS Cluster
5.
Conclusion
Formic Acid
Zone 1
Zone 2
8 7
7
6
6
5 5
8 34 8.32 8.34 8 32
83 8.3
8 28 8.26 8.28 8 26 8.24 8 24 8.22 8 22
3500
7
0.3
6
0.2
5
0.1
9
Zone 3
50
Zone 1 Zone 2 Zone 3 Zone 4 Zone 5 Zone 6 Zone 7
100
150
start 690 955 1310 1890 1940 2300 2960
590
200
250 end 640 990 1340 1930 1980 2360 3130
7 52 7.52
75 7.5
6.7
6.68
Zone 5
7
6.66
6.64
5.28
?
5.24
5.22
5.2
Zone 6
8 6
5.16
10
5.26
10
6
12
α-glucose
Zone 4
8 6
8
0
7 54 7.54
10
0.4 2000
3000
7 56 7.56
HMF
1500
2500
7 58 7.58
5.14
5.12
5.1
5.08
5.06
Zone 7
4
3.95
3.9
Glucose-Fructose
Balsamic l i : ---Traditional balsamic:----
8 6 2.25 2.2 2.15 2.1 2.05
2
1.95
Corresponding zones
Acetic acid Lyon, 29th November 2007
26
ICA model on selection 3 components
1.
Sample and data sets
2.
Chemometric methods
3.
Discrimination with EWZS
4.
Discrimination with Interval PLS_Cluster PLS Cluster
5.
Conclusion
2
2
0
0
28.34 2
8.32
8.3
8.28
8.26
8.24
8.22
7.54
7.52
7.5
0
6.7
6.68
6.66
5
6.64
5
5
0
0
55.16
7.56
5
0 2
2 7.58
5.14
5.12
5.1
5.08
5.06
5
5.28
5.26
4
5.24
3.95
5.22
5.2
3.9
IC 1 ------IC 2 ------IC 3 -------
0
2.25
Lyon, 29th November 2007
2.2
2.15
2.1
2.05
2
1.95
IC1 looks like the signal of -TB 27
Discrimination of sample on IC1
1.
Sample and data sets
2.
Chemometric methods
3.
Discrimination with EWZS
4.
Discrimination with Interval PLS_Cluster PLS Cluster
5.
Conclusion
Ba a
12 1.2
All samples correctly classified by leave-1-out leave 1 out cross-validation cross validation (the 3 LVs are taken into account)
1.1 Ba
Ba
1 Ba
IC2
09 0.9
Ba
Ba Ba
Ba Ba
0.8
Ba Ba
0.7 0.6
TB TB TB TB TB TB TB TB TB TB
Ba Ba TB TB TB
TB
0.5 -1.1 11
Lyon, 29th November 2007
-1 1
-0.9 09
Ba
TB
-0.8 08
-0.7 07 -0.6 06 IC1
-0.5 05
-0.4 04
-0.3 03
-0.2 02
28
1. Sample and data sets 2 Chemometric 2. Ch t i methods th d 3. Discrimination following Evolving Window Zone Selection 4. Discrimination following Interval-PLS_Cluster 5. Conclusion
Lyon, 29th November 2007
29
Reference PLS-Cluster
1.
Sample and data sets
2.
Chemometric methods
3.
Discrimination with EWZS
4.
Discrimination with Interval PLS_Cluster PLS Cluster
5.
Conclusion
4
4
Spectral zone
x 10
2 0 -2 10
9
8
7
6
5 4 ppm In green: TB and in red: Ba
3
2
1
0
20 28
10
27 21 20
0
23
18 17 4
19 3 25 16 24
10 6
-10 -20
29 22 26
0
Lyon, 29th November 2007
5
10
15 Dendrogram
20
11 15
14 13 9
8
5 12 2 7 1
25
30 30
Interval PLS-Cluster
1.
Sample and data sets
2.
Chemometric methods
3.
Discrimination with EWZS
4.
Discrimination with Interval PLS_Cluster PLS Cluster
5.
Conclusion
Parameters: window size: 80 p points, step: p 20 p points
Lyon, 29th November 2007
31
Selected zones
1.
Sample and data sets
2.
Chemometric methods
3.
Discrimination with EWZS
4.
Discrimination with Interval PLS_Cluster PLS Cluster
5.
Conclusion
17 zones selected : some are contiguous Æ 9 zones
Tyr
600
30
x 10
4
2.5
10000
2
400
400
20
1.5
5000 200 0
8.3 x 10
7.6
7.5
7.1
Sugar
4
1000
5.2
5.1
4000
Β-glu
1 2000
0.5 3.6
3.4
Lyon, 29th November 2007
3.8
1000 Same as EWZS selection
600
15 1.5
3.9
800
6000
2
0
7
Ba en noir, TB en rouge
8000
2.5
05 0.5
0
0
8.1
1
10
200
0
3.2
3.1
3
500
400 200
Citric and malic acids 2.9
2.8
0
2.65
2.55
Succinic Acid 32
Clustering on the selected zone Zone 1
1.
Sample and data sets
2.
Chemometric methods
3.
Discrimination with EWZS
4.
Discrimination with Interval PLS_Cluster PLS Cluster
5.
Conclusion
Spectral zone
600
Formic acid
?
400
TB
200
Ba
0 -200 200 8.35
8.3
8.25
8.2
28
10
20
8
27 26 19 18
29 1615 21 8 10 25
2423 22 14 93
-10
Lyon, 29th November 2007
8.05
From Emilia
17
0
-20 0
8.1
ppm 1
20
Aged 6 years
8.15
5
10
15
20
25
7 16
From the F th market
1312 11 4 5 2
30
35 33
Discrimination of the samples on IC1 and IC2
3000
1.
Sample and data sets
2.
Chemometric methods
3.
Discrimination with EWZS
4.
Discrimination with Interval PLS_Cluster PLS Cluster
5.
Conclusion
All samples correctly classified by leave-1-out cross-validation (the 3 components are taken into account)
TB TB TB
2500
TB
B Ba TB
2000
IC2
TB
TB TB TB
1500 TB TB
Ba Ba
1000 TB TB TB
500
-3500 3500
Lyon, 29th November 2007
-3000 3000
-2500 2500
-2000 2000 IC1
-1500 1500
-1000 1000
Ba Ba TB
Ba Ba B Ba Ba Ba BaBa BaBa
-500 500
34
1. Sample and data sets 2 Chemometric 2. Ch t i methods th d 3. Discrimination following Evolving Window Zone Selection 4. Discrimination following Interval-PLS_Cluster 5. Conclusion
Lyon, 29th November 2007
35
Conclusion
1.
Sample and data sets
2.
Chemometric methods
3.
Discrimination with EWZS
4.
Discrimination with Interval PLS PLS-Cluster Cluster
5.
Conclusion
A procedure combining variable selection (EWZS or Interval Interval-PLS PLS_Cluster) Cluster) and ICA applied to 1H NMR spectra was developed to discriminate more p traditional balsamic vinegar g from industrial balsamic vinegar g expensive
The zones detected by EWZS correspond to compounds that are related t the to th aging i process : volatility l tilit (acetic ( ti acid), id) degradation d d ti (HMF), (HMF) etc. t
The zones detected by y Interval-PLS_Cluster were also related to aging g g processes, but also to other mechanisms such as sugar modification
Due D tto it its non-supervised i d nature, t Interval-PLS_Cluster I t l PLS Cl t is i able bl to t detect d t t different clustering factors. In the present case, it seems to have detected subpopulations within the two vinegar groups. groups Lyon, 29th November 2007
36
Acknowledgment
For their help, advise and time: My colleagues at AgroParisTech and Eurofins For financial support: uo s Eurofins French Ministry of Research support of the convention CIFRE n° 169-2005
Lyon, 29th November 2007
37
Acknowledgment
Th k you for Thank f your attention tt ti
Lyon, 29th November 2007
38