Download ePaper

on outsourcing and data management - icrisat

on outsourcing and data management - icrisat on outsourcing and data management - icrisat

from icrisat.org More from this publisher

20.01.2014 Views

Outsourcing and Data handling Issues

Outsourcing and Data handling Issues

What do you want to do and who is it for ?

Outsourcing

• We don’t own a machine and we can’t yet see a good reason

to get one

• The technology is changing rapidly we’re not big enough to

support multiple NGS technologies. If we bought a machine it

would probably be the wrong one.

• Until recently our throughput would not have filled the capacity

of a machine

• We have a wide spectrum of NGS needs which map onto

different library types and sequencing technologies

• The problem for us then becomes who do we go to and what

decision processes to we use to help us make the choice.

Reasons to choose an NGS provider

• Price

• But be sensible as to how to judge price

• Price should be judged on a full economic cost model

• NGS sequencing can be thought of as multi component

package – at the very least

• Choice of biological material

• Library construction

• Sequencing

• Base calling

• Informatics

• We need to start from the point of developing a clear view of

what we are sequencing and why ? This should take account

of not only the immediate aim of the work but also potential

future value of the data

What are we sequencing and why ?

• Is this a genome or transcriptome sequencing project ? The

choice is not always a black and white one.

• Have we thought clearly about how many and what genotypes

and /or tissues we are going to sequence ?

• Should we pool genotypes or tissues ?

• Is the choice consistent with our primary aims ? i.e. Have we

chosen genotypes to minimise ascertainment bias or are the

developmental stages or tissues we want sufficiently

represented

• Have we thought about other uses for the data ?

• Have we talked it thought with someone who actually knows

the time of day

Outsourcing

• Lets suppose we have done the groundwork. We know what

we want to do and why.

• So where do we go ? Its not rocket science !

• The first step is to talk through our requirements with several

potential providers and or try and identify and talk to some

groups who have already done the sort of thing you have in

mind,

• Be aware that we are ALL amateurs in this game with few

exceptions. So don’t trust the hype from commercial or

academic providers.

• Look for experience in preparing the library types you want to

sequence and ideally plant tissue sources.

• Who picks up the tab if a library construction of sequence run

fails ?

Once you have made the choice

• Make sure you know what you are getting and when. We have

had several experiences of people generating our data in little

bits and squirting it back to us in sporadic lumps.

• A professional provider will know and understand their

service and should give you good stats / metadata on a run.

Ask them to provide you with some examples of this

• Sequence a little first and then check what you have. Eg. Start

with a lane and then check for things like redundancy length

distribution, chloroplast contamination etc depending on your

project.

• This QC checking needs to start early in the process. It’s no

use finding its all crap when all the money is spent.

• It took us months to begin to learn what the issues are. We are

still learning

Getting data

• Your responsible for your own data you can’t expect the

sequencer to keep a copy, We lodge everything in a

subversion repository when it comes in. Check your

organisations backup strategy.

• While your are at it who has the library you were sequencing if

the data is good can you go back to get more sequence from

the library

• Naming things is important

• There is a lot of data but not a stupid amount but be careful to

take account of the fact that you might need lots of analysis

runs on the same data and you may want to hang on to them

for a bit. They can however be offline. USB disks are cheep !

Sending them is also a good way to network big data sets.

Processing and analysis

• We discussed many of the issues but realise that the whole

thing is moving on. The technologies are changing and the

analysis and visualization methods are evolving in response.

• The published literature is often out of date.

• There are a number of user groups either physical or online

make use of them but also be sceptical. Sometimes those who

make most noise are not the ones who actually know the time

of day.

• This might include me !!!!!!

• Consider setting up your own user group if this helps. You can

do a lot even through something like Webex.

• Lots of formats for data - sequence and quality.

Data and analysis

• NGS quality scores are not always what they seems. They are

NOT phred quality scores and they may not be equivalent to

one another. i.e. For Solexa data there are several different

ways /pipelines for calling and the corresponding quality

scores are not simple equivalents.

• Need as metadata as well as FASTA/FASTQ files otherwise a

SNP calling approach tuned to one quality type may not be

optimised for another.

• You may have lots of data files as well as lots of output files

from analysis. Numbers may be harder to manage than

volume.

• For many analysis versioning is important.

NGS Meetings

on outsourcing and data management - icrisat

on outsourcing and data management - icrisat ... View more on outsourcing and data management - icrisat

Delete template?

Save as template ?

on outsourcing and data management - icrisat on outsourcing and data management - icrisat