on outsourcing and data management - icrisat
on outsourcing and data management - icrisat on outsourcing and data management - icrisat
Outsourcing and Data handling Issues
- Page 2 and 3: What do you want to do and who is i
- Page 4 and 5: Reasons to choose an NGS provider
- Page 6 and 7: Outsourcing • Lets suppose we hav
- Page 8 and 9: Getting data • Your responsible f
- Page 10 and 11: • Lots of formats for data - sequ
Outsourcing <strong>and</strong> Data h<strong>and</strong>ling Issues
What do you want to do <strong>and</strong> who is it for ?
Outsourcing<br />
• We d<strong>on</strong>’t own a machine <strong>and</strong> we can’t yet see a good reas<strong>on</strong><br />
to get <strong>on</strong>e<br />
• The technology is changing rapidly we’re not big enough to<br />
support multiple NGS technologies. If we bought a machine it<br />
would probably be the wr<strong>on</strong>g <strong>on</strong>e.<br />
• Until recently our throughput would not have filled the capacity<br />
of a machine<br />
• We have a wide spectrum of NGS needs which map <strong>on</strong>to<br />
different library types <strong>and</strong> sequencing technologies<br />
• The problem for us then becomes who do we go to <strong>and</strong> what<br />
decisi<strong>on</strong> processes to we use to help us make the choice.
Reas<strong>on</strong>s to choose an NGS provider<br />
• Price<br />
• But be sensible as to how to judge price<br />
• Price should be judged <strong>on</strong> a full ec<strong>on</strong>omic cost model<br />
• NGS sequencing can be thought of as multi comp<strong>on</strong>ent<br />
package – at the very least<br />
• Choice of biological material<br />
• Library c<strong>on</strong>structi<strong>on</strong><br />
• Sequencing<br />
• Base calling<br />
• Informatics<br />
• We need to start from the point of developing a clear view of<br />
what we are sequencing <strong>and</strong> why ? This should take account<br />
of not <strong>on</strong>ly the immediate aim of the work but also potential<br />
future value of the <strong>data</strong>
What are we sequencing <strong>and</strong> why ?<br />
• Is this a genome or transcriptome sequencing project ? The<br />
choice is not always a black <strong>and</strong> white <strong>on</strong>e.<br />
• Have we thought clearly about how many <strong>and</strong> what genotypes<br />
<strong>and</strong> /or tissues we are going to sequence ?<br />
• Should we pool genotypes or tissues ?<br />
• Is the choice c<strong>on</strong>sistent with our primary aims ? i.e. Have we<br />
chosen genotypes to minimise ascertainment bias or are the<br />
developmental stages or tissues we want sufficiently<br />
represented<br />
• Have we thought about other uses for the <strong>data</strong> ?<br />
• Have we talked it thought with some<strong>on</strong>e who actually knows<br />
the time of day
Outsourcing<br />
• Lets suppose we have d<strong>on</strong>e the groundwork. We know what<br />
we want to do <strong>and</strong> why.<br />
• So where do we go ? Its not rocket science !<br />
• The first step is to talk through our requirements with several<br />
potential providers <strong>and</strong> or try <strong>and</strong> identify <strong>and</strong> talk to some<br />
groups who have already d<strong>on</strong>e the sort of thing you have in<br />
mind,<br />
• Be aware that we are ALL amateurs in this game with few<br />
excepti<strong>on</strong>s. So d<strong>on</strong>’t trust the hype from commercial or<br />
academic providers.<br />
• Look for experience in preparing the library types you want to<br />
sequence <strong>and</strong> ideally plant tissue sources.<br />
• Who picks up the tab if a library c<strong>on</strong>structi<strong>on</strong> of sequence run<br />
fails ?
Once you have made the choice<br />
• Make sure you know what you are getting <strong>and</strong> when. We have<br />
had several experiences of people generating our <strong>data</strong> in little<br />
bits <strong>and</strong> squirting it back to us in sporadic lumps.<br />
• A professi<strong>on</strong>al provider will know <strong>and</strong> underst<strong>and</strong> their<br />
service <strong>and</strong> should give you good stats / meta<strong>data</strong> <strong>on</strong> a run.<br />
Ask them to provide you with some examples of this<br />
• Sequence a little first <strong>and</strong> then check what you have. Eg. Start<br />
with a lane <strong>and</strong> then check for things like redundancy length<br />
distributi<strong>on</strong>, chloroplast c<strong>on</strong>taminati<strong>on</strong> etc depending <strong>on</strong> your<br />
project.<br />
• This QC checking needs to start early in the process. It’s no<br />
use finding its all crap when all the m<strong>on</strong>ey is spent.<br />
• It took us m<strong>on</strong>ths to begin to learn what the issues are. We are<br />
still learning
Getting <strong>data</strong><br />
• Your resp<strong>on</strong>sible for your own <strong>data</strong> you can’t expect the<br />
sequencer to keep a copy, We lodge everything in a<br />
subversi<strong>on</strong> repository when it comes in. Check your<br />
organisati<strong>on</strong>s backup strategy.<br />
• While your are at it who has the library you were sequencing if<br />
the <strong>data</strong> is good can you go back to get more sequence from<br />
the library<br />
• Naming things is important<br />
• There is a lot of <strong>data</strong> but not a stupid amount but be careful to<br />
take account of the fact that you might need lots of analysis<br />
runs <strong>on</strong> the same <strong>data</strong> <strong>and</strong> you may want to hang <strong>on</strong> to them<br />
for a bit. They can however be offline. USB disks are cheep !<br />
Sending them is also a good way to network big <strong>data</strong> sets.
Processing <strong>and</strong> analysis<br />
• We discussed many of the issues but realise that the whole<br />
thing is moving <strong>on</strong>. The technologies are changing <strong>and</strong> the<br />
analysis <strong>and</strong> visualizati<strong>on</strong> methods are evolving in resp<strong>on</strong>se.<br />
• The published literature is often out of date.<br />
• There are a number of user groups either physical or <strong>on</strong>line<br />
make use of them but also be sceptical. Sometimes those who<br />
make most noise are not the <strong>on</strong>es who actually know the time<br />
of day.<br />
• This might include me !!!!!!<br />
• C<strong>on</strong>sider setting up your own user group if this helps. You can<br />
do a lot even through something like Webex.
• Lots of formats for <strong>data</strong> - sequence <strong>and</strong> quality.<br />
Data <strong>and</strong> analysis<br />
• NGS quality scores are not always what they seems. They are<br />
NOT phred quality scores <strong>and</strong> they may not be equivalent to<br />
<strong>on</strong>e another. i.e. For Solexa <strong>data</strong> there are several different<br />
ways /pipelines for calling <strong>and</strong> the corresp<strong>on</strong>ding quality<br />
scores are not simple equivalents.<br />
• Need as meta<strong>data</strong> as well as FASTA/FASTQ files otherwise a<br />
SNP calling approach tuned to <strong>on</strong>e quality type may not be<br />
optimised for another.<br />
• You may have lots of <strong>data</strong> files as well as lots of output files<br />
from analysis. Numbers may be harder to manage than<br />
volume.<br />
• For many analysis versi<strong>on</strong>ing is important.
NGS Meetings