02.08.2013 Views

Sample A: Cover Page of Thesis, Project, or Dissertation Proposal

Sample A: Cover Page of Thesis, Project, or Dissertation Proposal

Sample A: Cover Page of Thesis, Project, or Dissertation Proposal

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

cleansing process and were assessed f<strong>or</strong> their classification ability. The <strong>or</strong>iginal rep<strong>or</strong>t <strong>of</strong> the<br />

Stearman experiment rep<strong>or</strong>ted a list <strong>of</strong> 409 genes, which demonstrated conc<strong>or</strong>dance in log2<br />

difference (tum<strong>or</strong> minus n<strong>or</strong>mal) between the murine and human model [4]. Surviving the BaFL<br />

cleansing process from this list were 178 ProbeSets. The conc<strong>or</strong>dances between the <strong>or</strong>iginal three<br />

lists are relatively small, containing 58 and 34 ProbeSets, respectively, f<strong>or</strong> the 0.8 (Bhattacharjee<br />

x Stearman) and 0.85 (Bhatacharjee x Stearman) lists, so the size <strong>of</strong> the conc<strong>or</strong>dant lists against<br />

BaFL is not unreasonable. We compared the classification abilities <strong>of</strong> these three lists to that <strong>of</strong><br />

the 325 ProbeSets whose aggregate was classified as differentially expressed f<strong>or</strong> both the BaFL<br />

cleansed Bhattacharjee and Stearman data. ProbeSet values were again generated by each <strong>of</strong> the<br />

three data cleansing pipelines as part <strong>of</strong> the comparison. Complete ProbeSet lists are given in the<br />

Supplementary Materials, in the Data folder.<br />

Classification<br />

The classification perf<strong>or</strong>mance <strong>of</strong> supervised learning methods was assessed, using the various<br />

candidate gene lists as training sets. The area under the receiver operating curve (AUC) was the<br />

perf<strong>or</strong>mance metric f<strong>or</strong> all the classification experiments [13, 14]. F<strong>or</strong> each alg<strong>or</strong>ithm, the base<br />

model included all <strong>of</strong> the <strong>or</strong>iginal surviving ProbeSets, either the 12,625 (f<strong>or</strong> RMA and dCHIP)<br />

<strong>or</strong> the intersecting 4,200 (from BaFL). A requirement <strong>of</strong> the classification experiments is that the<br />

same ProbeSets need to be present in both datasets. This only affects the BaFL data because<br />

RMA and dCHIP give complete gene value matrices [2, 18]. Theref<strong>or</strong>e, f<strong>or</strong> the BaFL ProbeSet<br />

lists, subsets that consisted <strong>of</strong> candidate gene list intersections were used. The values in the<br />

resulting ProbeSet lists were then used, in turn, to train three different classification alg<strong>or</strong>ithms: k<br />

nearest neighb<strong>or</strong>s (kNN) [7, 23, 35], linear discriminant analysis (LDA) [8, 9, 20] and random<br />

f<strong>or</strong>est (RF) [10-12]. The R implementations <strong>of</strong> these alg<strong>or</strong>ithms were used [16]. The parameters<br />

73

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!