Sample A: Cover Page of Thesis, Project, or Dissertation Proposal
Sample A: Cover Page of Thesis, Project, or Dissertation Proposal
Sample A: Cover Page of Thesis, Project, or Dissertation Proposal
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Chapter 4: Data Mining<br />
Adenocarcinoma is a non small cell (NSCLC) lung cancer sub-type, and the most frequent type <strong>of</strong><br />
lung cancer found in the w<strong>or</strong>ld today [1, 2]. Adenocarcinomas are peripherally located in the<br />
lungs and develop from clara cells, alveoli, and mucin producing cells [1]. While, tobacco<br />
smoking has been well established as an initiating condition f<strong>or</strong> lung cancer, with 80-90% <strong>of</strong> lung<br />
cancer cases arising in tobacco smokers, adenocarcinoma in particular is most common among<br />
women, non-smokers, and the young [1]. Given that the incidence rate <strong>of</strong> adenocarcinoma is<br />
increasing and affecting non-traditional patients, understanding the disease is <strong>of</strong> immediate<br />
concern [1, 2].<br />
Using the methods described in the previous chapters, we have created two 2-class datasets, one<br />
from a subset <strong>of</strong> the Bhattacharjee dataset and the other from the <strong>or</strong>iginal Stearman (human<br />
subset) experiments [3, 4]. In this chapter we begin with the down-selected ProbeSet list,<br />
presented in Chapter 3, consisting <strong>of</strong> those 325 differentially expressed ProbeSets common to<br />
both datasets. The values that the BaFL pipeline yields f<strong>or</strong> these ProbeSets lead to datasets with<br />
considerable latent structure; we will demonstrate that this latent structure is superi<strong>or</strong> to that <strong>of</strong><br />
RMA and dCHIP supplied values using two widely-accepted dimensionality reduction methods:<br />
Principal Components Analysis [5, 6], which is linear , and a Laplacian method which is non-<br />
linear [7]. In validating the results <strong>of</strong> these analyses, we use sample c<strong>or</strong>relation to expl<strong>or</strong>e the<br />
gene/ProbeSet clusters.<br />
92