How to format your data for importing into SIMCA-P - Umetrics
How to format your data for importing into SIMCA-P - Umetrics
How to format your data for importing into SIMCA-P - Umetrics
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>How</strong> <strong>to</strong> <strong><strong>for</strong>mat</strong> <strong>your</strong> <strong>data</strong><br />
<strong>for</strong> <strong>importing</strong> in<strong>to</strong> <strong>SIMCA</strong>-P<br />
Head Quarter:<br />
<strong>Umetrics</strong> AB<br />
Box 7960<br />
SE-90719 Umeå<br />
Sweden<br />
Phone: +46 (0)90 184800<br />
Fax: +46 (0)90 184899<br />
Email: info.se@umetrics.com<br />
Content<br />
General points ........................................................ 2<br />
1. QSAR ................................................................. 2<br />
2. Genomics / Proteomics / Metabonomics .............. 2<br />
3. Spectroscopic <strong>data</strong> ............................................. 2<br />
4. Process <strong>data</strong> (or biological time courses) ............. 3<br />
5. LC-MS or GC-MS <strong>data</strong> ........................................ 4<br />
6. Chroma<strong>to</strong>graphy & Electrophoresis <strong>data</strong> ............... 4<br />
Appendix:<br />
GUIdInG THE WAY – USInG <strong>SIMCA</strong>-P<br />
Detailed batch <strong>data</strong> description ............................... 5<br />
www.umetrics.com<br />
European Sales Offices:<br />
<strong>Umetrics</strong> AB<br />
S<strong>to</strong>r<strong>to</strong>rget 21<br />
SE-211 34 Malmö<br />
Sweden<br />
Phone: +46 (0)40 6642580<br />
Fax: +46 (0)40 6642585<br />
Email: info.se@umetrics.com<br />
<strong>Umetrics</strong> UK Ltd.<br />
Woodside House,<br />
Winkfield, Windsor<br />
Berkshire, SL4 2DX, UK<br />
Phone: +44 (0)1344 885605<br />
Fax: +44 (0)1344 885410<br />
Email: info.uk@umetrics.com<br />
The Standard in Multivariate Data Analysis<br />
<strong>SIMCA</strong>-P And <strong>SIMCA</strong>-P+<br />
North America:<br />
<strong>Umetrics</strong> Inc.<br />
17 Kiel Ave.<br />
Kinnelon NJ 07405<br />
USA<br />
Phone: +1 973 492 8355<br />
Fax: +1 973 492 8359<br />
Email: info.us@umetrics.com<br />
IID 2029 2005
General points<br />
• Data can be compiled in Excel or as comma separated values (.csv)<br />
• Data may be held in Access <strong>data</strong>base files<br />
• If <strong>your</strong> <strong>data</strong> has replicates it is usually best <strong>to</strong> include these rather than averaging them.<br />
• Process <strong>data</strong> should be collected at the same time points.<br />
• Small amounts of missing <strong>data</strong> can be <strong>to</strong>lerated.<br />
1. QSAR<br />
Compound 1<br />
Compound 2<br />
Compound 3<br />
Molecular or Physical Properties Activity <strong>data</strong><br />
Var1 Var2 Var3 Var4 Act 1 Act 2 Act 3<br />
2. Genomics / Proteomics / Metabonomics<br />
If you have more than 256 gene / protien / NMR columns and you are using Excel you will have <strong>to</strong> initially <strong><strong>for</strong>mat</strong> the<br />
table with the gene <strong>data</strong> arranged vertically. Data can then be transposed during <strong>SIMCA</strong>-P import. Otherwise use<br />
Comma separated value files (.CSV).<br />
Dose 1<br />
Dose 2 Fluorescence <strong>data</strong><br />
Dose 3<br />
Dose n<br />
Gene / Protein / Peak Dose<br />
Gene / Protein / Peak Dose<br />
Drug 1 Dose 1<br />
Drug 1 Fluorescence <strong>data</strong> Dose 2<br />
Drug 2 Dose 1<br />
Drug 2 Dose 2<br />
Dose n Dose n<br />
3. Spectroscopic <strong>data</strong><br />
Most spectroscopic instruments can save <strong>data</strong> in one of the following <strong><strong>for</strong>mat</strong>s that <strong>SIMCA</strong>-P supports JCAMP-DX,<br />
MVACDF, Brimrose files, Galactic SPC files, NSAS files. Saving them like this is <strong>your</strong> first choice.<br />
For spectral <strong>data</strong> with less than 256 variables (wavelengths) they can also be saved in Excel as below.<br />
Spectra 1<br />
Spectra 2 Spectroscopy <strong>data</strong><br />
Spectra n<br />
For spectra with more than 256 variables (wavelengths) <strong><strong>for</strong>mat</strong> with spectral <strong>data</strong> going down in columns <strong>to</strong> overcome<br />
the maximum column width of 256 in Excel. Data is then transposed during <strong>SIMCA</strong>-P import<br />
UMETRICS GUIDING THE WAY – USING <strong>SIMCA</strong>-P 2<br />
Y-var
Example: NIR spectra of raw material:<br />
Spectrum<br />
Batch 1<br />
Spectrum<br />
Batch 2<br />
Spectrum<br />
Batch n<br />
Wavelength Absorbance Absorbance Absorbance<br />
210 0.12 0.45 0.34<br />
211 0.45 0.12 0.23<br />
212….etc 0.23 0.23 0.12<br />
Example: Comparing spectra of different compounds:<br />
Compound 1 Compound 2 Compound n<br />
Wavelength Absorbance Absorbance Absorbance<br />
210 0.12 0.45 0.34<br />
211 0.45 0.12 0.23<br />
212….etc 0.23 0.23 0.12<br />
Property or activity 1<br />
Property or activity 2<br />
4. Process <strong>data</strong> (or biological time courses)<br />
Moni<strong>to</strong>ring a process:<br />
Time 1<br />
Time 2<br />
Time 3<br />
Tag description<br />
(ex. pressure pos2)<br />
Tag name<br />
(ex. FFC2-p002-xx01)<br />
Variables<br />
Equal distances between time samples is preferable if it is a continuous process. The identity of the measured variables<br />
are preferable in two or more cells in each column, one <strong>for</strong> an interpretable name and one <strong>for</strong> the technical tag<br />
name in the <strong>data</strong>base.<br />
Modelling a process output:<br />
Time 1<br />
Time 2<br />
Time n<br />
Input Measurements Quality or Output Measurements<br />
For batch <strong>data</strong> (<strong>for</strong> metabonomics <strong>data</strong> each batch is an individual):<br />
Time<br />
Time<br />
Variables<br />
Batch 1<br />
Batch 2<br />
See the Appendix <strong>for</strong> a detailed description of how <strong>to</strong> <strong><strong>for</strong>mat</strong> batch <strong>data</strong>.<br />
UMETRICS GUIDING THE WAY – USING <strong>SIMCA</strong>-P 3
5. LC-MS or GC-MS <strong>data</strong><br />
LC or GC –MS <strong>data</strong> is in the <strong>for</strong>m of a 3 dimensional <strong>data</strong> cube. The way <strong>to</strong> <strong><strong>for</strong>mat</strong> this <strong>data</strong> is <strong>to</strong> unfold it so that<br />
each variable becomes a time:mass paired variable. Software such as Marker Lynx from Waters and Metalign from<br />
Plant Research International can be used <strong>for</strong> this.<br />
Sample<br />
Sample<br />
Sample 1<br />
Sample 2<br />
Sample 3<br />
Time<br />
1.0min:245 m/z 1.0min:345 m/z 1.2min:567 m/z 3.4min:678 m/z<br />
6. Chroma<strong>to</strong>graphy & Electrophoresis <strong>data</strong><br />
m/z<br />
T1 T1 T1 T1<br />
HPLC, GC and CE <strong>data</strong> is best used as peak tables using the chroma<strong>to</strong>graphic instrument software <strong>to</strong> deal with small<br />
shifts in retention times by means of retention time windowing. Even slight shifts in retention time will hinder chemometric<br />
analysis severely if the raw <strong>data</strong> is used.<br />
Peak1 Peak2 Peak3 Peak4<br />
Sample1 1.0% 33.6% 25.5% 2.7%<br />
Sample2 2.3% 23.1% 34.5% 10.8%<br />
Sample3 5.4% 26.8% 41.6% 5.4%<br />
2-D Electrophoresis Gels and Proteomics <strong>data</strong> is best analysed as peak tables constructed by dedicated image analysis<br />
software be<strong>for</strong>e import in<strong>to</strong> <strong>SIMCA</strong>-P.<br />
UMETRICS GUIDING THE WAY – USING <strong>SIMCA</strong>-P 4
Detailed batch <strong>data</strong> description<br />
APPENDIX<br />
1. Compile initial conditions, batch evolution and final quality <strong>data</strong>.<br />
2. Consider every single source of variation:<br />
• What went in <strong>to</strong> the batch?<br />
• What has been done <strong>to</strong> the batch?<br />
This includes SOP logs, recipes, provenance of raw materials, process his<strong>to</strong>rian <strong>data</strong>, process logs, meta<strong>data</strong><br />
(opera<strong>to</strong>r, equipment, reac<strong>to</strong>r, site) etc.<br />
3. Compile the <strong>data</strong> in a <strong><strong>for</strong>mat</strong> similar <strong>to</strong> the table below:<br />
UMETRICS GUIDING THE WAY – USING <strong>SIMCA</strong>-P 5