24.02.2013 Views

How to format your data for importing into SIMCA-P - Umetrics

How to format your data for importing into SIMCA-P - Umetrics

How to format your data for importing into SIMCA-P - Umetrics

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>How</strong> <strong>to</strong> <strong><strong>for</strong>mat</strong> <strong>your</strong> <strong>data</strong><br />

<strong>for</strong> <strong>importing</strong> in<strong>to</strong> <strong>SIMCA</strong>-P<br />

Head Quarter:<br />

<strong>Umetrics</strong> AB<br />

Box 7960<br />

SE-90719 Umeå<br />

Sweden<br />

Phone: +46 (0)90 184800<br />

Fax: +46 (0)90 184899<br />

Email: info.se@umetrics.com<br />

Content<br />

General points ........................................................ 2<br />

1. QSAR ................................................................. 2<br />

2. Genomics / Proteomics / Metabonomics .............. 2<br />

3. Spectroscopic <strong>data</strong> ............................................. 2<br />

4. Process <strong>data</strong> (or biological time courses) ............. 3<br />

5. LC-MS or GC-MS <strong>data</strong> ........................................ 4<br />

6. Chroma<strong>to</strong>graphy & Electrophoresis <strong>data</strong> ............... 4<br />

Appendix:<br />

GUIdInG THE WAY – USInG <strong>SIMCA</strong>-P<br />

Detailed batch <strong>data</strong> description ............................... 5<br />

www.umetrics.com<br />

European Sales Offices:<br />

<strong>Umetrics</strong> AB<br />

S<strong>to</strong>r<strong>to</strong>rget 21<br />

SE-211 34 Malmö<br />

Sweden<br />

Phone: +46 (0)40 6642580<br />

Fax: +46 (0)40 6642585<br />

Email: info.se@umetrics.com<br />

<strong>Umetrics</strong> UK Ltd.<br />

Woodside House,<br />

Winkfield, Windsor<br />

Berkshire, SL4 2DX, UK<br />

Phone: +44 (0)1344 885605<br />

Fax: +44 (0)1344 885410<br />

Email: info.uk@umetrics.com<br />

The Standard in Multivariate Data Analysis<br />

<strong>SIMCA</strong>-P And <strong>SIMCA</strong>-P+<br />

North America:<br />

<strong>Umetrics</strong> Inc.<br />

17 Kiel Ave.<br />

Kinnelon NJ 07405<br />

USA<br />

Phone: +1 973 492 8355<br />

Fax: +1 973 492 8359<br />

Email: info.us@umetrics.com<br />

IID 2029 2005


General points<br />

• Data can be compiled in Excel or as comma separated values (.csv)<br />

• Data may be held in Access <strong>data</strong>base files<br />

• If <strong>your</strong> <strong>data</strong> has replicates it is usually best <strong>to</strong> include these rather than averaging them.<br />

• Process <strong>data</strong> should be collected at the same time points.<br />

• Small amounts of missing <strong>data</strong> can be <strong>to</strong>lerated.<br />

1. QSAR<br />

Compound 1<br />

Compound 2<br />

Compound 3<br />

Molecular or Physical Properties Activity <strong>data</strong><br />

Var1 Var2 Var3 Var4 Act 1 Act 2 Act 3<br />

2. Genomics / Proteomics / Metabonomics<br />

If you have more than 256 gene / protien / NMR columns and you are using Excel you will have <strong>to</strong> initially <strong><strong>for</strong>mat</strong> the<br />

table with the gene <strong>data</strong> arranged vertically. Data can then be transposed during <strong>SIMCA</strong>-P import. Otherwise use<br />

Comma separated value files (.CSV).<br />

Dose 1<br />

Dose 2 Fluorescence <strong>data</strong><br />

Dose 3<br />

Dose n<br />

Gene / Protein / Peak Dose<br />

Gene / Protein / Peak Dose<br />

Drug 1 Dose 1<br />

Drug 1 Fluorescence <strong>data</strong> Dose 2<br />

Drug 2 Dose 1<br />

Drug 2 Dose 2<br />

Dose n Dose n<br />

3. Spectroscopic <strong>data</strong><br />

Most spectroscopic instruments can save <strong>data</strong> in one of the following <strong><strong>for</strong>mat</strong>s that <strong>SIMCA</strong>-P supports JCAMP-DX,<br />

MVACDF, Brimrose files, Galactic SPC files, NSAS files. Saving them like this is <strong>your</strong> first choice.<br />

For spectral <strong>data</strong> with less than 256 variables (wavelengths) they can also be saved in Excel as below.<br />

Spectra 1<br />

Spectra 2 Spectroscopy <strong>data</strong><br />

Spectra n<br />

For spectra with more than 256 variables (wavelengths) <strong><strong>for</strong>mat</strong> with spectral <strong>data</strong> going down in columns <strong>to</strong> overcome<br />

the maximum column width of 256 in Excel. Data is then transposed during <strong>SIMCA</strong>-P import<br />

UMETRICS GUIDING THE WAY – USING <strong>SIMCA</strong>-P 2<br />

Y-var


Example: NIR spectra of raw material:<br />

Spectrum<br />

Batch 1<br />

Spectrum<br />

Batch 2<br />

Spectrum<br />

Batch n<br />

Wavelength Absorbance Absorbance Absorbance<br />

210 0.12 0.45 0.34<br />

211 0.45 0.12 0.23<br />

212….etc 0.23 0.23 0.12<br />

Example: Comparing spectra of different compounds:<br />

Compound 1 Compound 2 Compound n<br />

Wavelength Absorbance Absorbance Absorbance<br />

210 0.12 0.45 0.34<br />

211 0.45 0.12 0.23<br />

212….etc 0.23 0.23 0.12<br />

Property or activity 1<br />

Property or activity 2<br />

4. Process <strong>data</strong> (or biological time courses)<br />

Moni<strong>to</strong>ring a process:<br />

Time 1<br />

Time 2<br />

Time 3<br />

Tag description<br />

(ex. pressure pos2)<br />

Tag name<br />

(ex. FFC2-p002-xx01)<br />

Variables<br />

Equal distances between time samples is preferable if it is a continuous process. The identity of the measured variables<br />

are preferable in two or more cells in each column, one <strong>for</strong> an interpretable name and one <strong>for</strong> the technical tag<br />

name in the <strong>data</strong>base.<br />

Modelling a process output:<br />

Time 1<br />

Time 2<br />

Time n<br />

Input Measurements Quality or Output Measurements<br />

For batch <strong>data</strong> (<strong>for</strong> metabonomics <strong>data</strong> each batch is an individual):<br />

Time<br />

Time<br />

Variables<br />

Batch 1<br />

Batch 2<br />

See the Appendix <strong>for</strong> a detailed description of how <strong>to</strong> <strong><strong>for</strong>mat</strong> batch <strong>data</strong>.<br />

UMETRICS GUIDING THE WAY – USING <strong>SIMCA</strong>-P 3


5. LC-MS or GC-MS <strong>data</strong><br />

LC or GC –MS <strong>data</strong> is in the <strong>for</strong>m of a 3 dimensional <strong>data</strong> cube. The way <strong>to</strong> <strong><strong>for</strong>mat</strong> this <strong>data</strong> is <strong>to</strong> unfold it so that<br />

each variable becomes a time:mass paired variable. Software such as Marker Lynx from Waters and Metalign from<br />

Plant Research International can be used <strong>for</strong> this.<br />

Sample<br />

Sample<br />

Sample 1<br />

Sample 2<br />

Sample 3<br />

Time<br />

1.0min:245 m/z 1.0min:345 m/z 1.2min:567 m/z 3.4min:678 m/z<br />

6. Chroma<strong>to</strong>graphy & Electrophoresis <strong>data</strong><br />

m/z<br />

T1 T1 T1 T1<br />

HPLC, GC and CE <strong>data</strong> is best used as peak tables using the chroma<strong>to</strong>graphic instrument software <strong>to</strong> deal with small<br />

shifts in retention times by means of retention time windowing. Even slight shifts in retention time will hinder chemometric<br />

analysis severely if the raw <strong>data</strong> is used.<br />

Peak1 Peak2 Peak3 Peak4<br />

Sample1 1.0% 33.6% 25.5% 2.7%<br />

Sample2 2.3% 23.1% 34.5% 10.8%<br />

Sample3 5.4% 26.8% 41.6% 5.4%<br />

2-D Electrophoresis Gels and Proteomics <strong>data</strong> is best analysed as peak tables constructed by dedicated image analysis<br />

software be<strong>for</strong>e import in<strong>to</strong> <strong>SIMCA</strong>-P.<br />

UMETRICS GUIDING THE WAY – USING <strong>SIMCA</strong>-P 4


Detailed batch <strong>data</strong> description<br />

APPENDIX<br />

1. Compile initial conditions, batch evolution and final quality <strong>data</strong>.<br />

2. Consider every single source of variation:<br />

• What went in <strong>to</strong> the batch?<br />

• What has been done <strong>to</strong> the batch?<br />

This includes SOP logs, recipes, provenance of raw materials, process his<strong>to</strong>rian <strong>data</strong>, process logs, meta<strong>data</strong><br />

(opera<strong>to</strong>r, equipment, reac<strong>to</strong>r, site) etc.<br />

3. Compile the <strong>data</strong> in a <strong><strong>for</strong>mat</strong> similar <strong>to</strong> the table below:<br />

UMETRICS GUIDING THE WAY – USING <strong>SIMCA</strong>-P 5

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!