20.01.2014 Views

on outsourcing and data management - icrisat

on outsourcing and data management - icrisat

on outsourcing and data management - icrisat

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Outsourcing <strong>and</strong> Data h<strong>and</strong>ling Issues


What do you want to do <strong>and</strong> who is it for ?


Outsourcing<br />

• We d<strong>on</strong>’t own a machine <strong>and</strong> we can’t yet see a good reas<strong>on</strong><br />

to get <strong>on</strong>e<br />

• The technology is changing rapidly we’re not big enough to<br />

support multiple NGS technologies. If we bought a machine it<br />

would probably be the wr<strong>on</strong>g <strong>on</strong>e.<br />

• Until recently our throughput would not have filled the capacity<br />

of a machine<br />

• We have a wide spectrum of NGS needs which map <strong>on</strong>to<br />

different library types <strong>and</strong> sequencing technologies<br />

• The problem for us then becomes who do we go to <strong>and</strong> what<br />

decisi<strong>on</strong> processes to we use to help us make the choice.


Reas<strong>on</strong>s to choose an NGS provider<br />

• Price<br />

• But be sensible as to how to judge price<br />

• Price should be judged <strong>on</strong> a full ec<strong>on</strong>omic cost model<br />

• NGS sequencing can be thought of as multi comp<strong>on</strong>ent<br />

package – at the very least<br />

• Choice of biological material<br />

• Library c<strong>on</strong>structi<strong>on</strong><br />

• Sequencing<br />

• Base calling<br />

• Informatics<br />

• We need to start from the point of developing a clear view of<br />

what we are sequencing <strong>and</strong> why ? This should take account<br />

of not <strong>on</strong>ly the immediate aim of the work but also potential<br />

future value of the <strong>data</strong>


What are we sequencing <strong>and</strong> why ?<br />

• Is this a genome or transcriptome sequencing project ? The<br />

choice is not always a black <strong>and</strong> white <strong>on</strong>e.<br />

• Have we thought clearly about how many <strong>and</strong> what genotypes<br />

<strong>and</strong> /or tissues we are going to sequence ?<br />

• Should we pool genotypes or tissues ?<br />

• Is the choice c<strong>on</strong>sistent with our primary aims ? i.e. Have we<br />

chosen genotypes to minimise ascertainment bias or are the<br />

developmental stages or tissues we want sufficiently<br />

represented<br />

• Have we thought about other uses for the <strong>data</strong> ?<br />

• Have we talked it thought with some<strong>on</strong>e who actually knows<br />

the time of day


Outsourcing<br />

• Lets suppose we have d<strong>on</strong>e the groundwork. We know what<br />

we want to do <strong>and</strong> why.<br />

• So where do we go ? Its not rocket science !<br />

• The first step is to talk through our requirements with several<br />

potential providers <strong>and</strong> or try <strong>and</strong> identify <strong>and</strong> talk to some<br />

groups who have already d<strong>on</strong>e the sort of thing you have in<br />

mind,<br />

• Be aware that we are ALL amateurs in this game with few<br />

excepti<strong>on</strong>s. So d<strong>on</strong>’t trust the hype from commercial or<br />

academic providers.<br />

• Look for experience in preparing the library types you want to<br />

sequence <strong>and</strong> ideally plant tissue sources.<br />

• Who picks up the tab if a library c<strong>on</strong>structi<strong>on</strong> of sequence run<br />

fails ?


Once you have made the choice<br />

• Make sure you know what you are getting <strong>and</strong> when. We have<br />

had several experiences of people generating our <strong>data</strong> in little<br />

bits <strong>and</strong> squirting it back to us in sporadic lumps.<br />

• A professi<strong>on</strong>al provider will know <strong>and</strong> underst<strong>and</strong> their<br />

service <strong>and</strong> should give you good stats / meta<strong>data</strong> <strong>on</strong> a run.<br />

Ask them to provide you with some examples of this<br />

• Sequence a little first <strong>and</strong> then check what you have. Eg. Start<br />

with a lane <strong>and</strong> then check for things like redundancy length<br />

distributi<strong>on</strong>, chloroplast c<strong>on</strong>taminati<strong>on</strong> etc depending <strong>on</strong> your<br />

project.<br />

• This QC checking needs to start early in the process. It’s no<br />

use finding its all crap when all the m<strong>on</strong>ey is spent.<br />

• It took us m<strong>on</strong>ths to begin to learn what the issues are. We are<br />

still learning


Getting <strong>data</strong><br />

• Your resp<strong>on</strong>sible for your own <strong>data</strong> you can’t expect the<br />

sequencer to keep a copy, We lodge everything in a<br />

subversi<strong>on</strong> repository when it comes in. Check your<br />

organisati<strong>on</strong>s backup strategy.<br />

• While your are at it who has the library you were sequencing if<br />

the <strong>data</strong> is good can you go back to get more sequence from<br />

the library<br />

• Naming things is important<br />

• There is a lot of <strong>data</strong> but not a stupid amount but be careful to<br />

take account of the fact that you might need lots of analysis<br />

runs <strong>on</strong> the same <strong>data</strong> <strong>and</strong> you may want to hang <strong>on</strong> to them<br />

for a bit. They can however be offline. USB disks are cheep !<br />

Sending them is also a good way to network big <strong>data</strong> sets.


Processing <strong>and</strong> analysis<br />

• We discussed many of the issues but realise that the whole<br />

thing is moving <strong>on</strong>. The technologies are changing <strong>and</strong> the<br />

analysis <strong>and</strong> visualizati<strong>on</strong> methods are evolving in resp<strong>on</strong>se.<br />

• The published literature is often out of date.<br />

• There are a number of user groups either physical or <strong>on</strong>line<br />

make use of them but also be sceptical. Sometimes those who<br />

make most noise are not the <strong>on</strong>es who actually know the time<br />

of day.<br />

• This might include me !!!!!!<br />

• C<strong>on</strong>sider setting up your own user group if this helps. You can<br />

do a lot even through something like Webex.


• Lots of formats for <strong>data</strong> - sequence <strong>and</strong> quality.<br />

Data <strong>and</strong> analysis<br />

• NGS quality scores are not always what they seems. They are<br />

NOT phred quality scores <strong>and</strong> they may not be equivalent to<br />

<strong>on</strong>e another. i.e. For Solexa <strong>data</strong> there are several different<br />

ways /pipelines for calling <strong>and</strong> the corresp<strong>on</strong>ding quality<br />

scores are not simple equivalents.<br />

• Need as meta<strong>data</strong> as well as FASTA/FASTQ files otherwise a<br />

SNP calling approach tuned to <strong>on</strong>e quality type may not be<br />

optimised for another.<br />

• You may have lots of <strong>data</strong> files as well as lots of output files<br />

from analysis. Numbers may be harder to manage than<br />

volume.<br />

• For many analysis versi<strong>on</strong>ing is important.


NGS Meetings

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!