13.07.2015 Views

Scalable approaches for analysis of human genome-wide ...

Scalable approaches for analysis of human genome-wide ...

Scalable approaches for analysis of human genome-wide ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Scalable</strong> Approaches <strong>for</strong> Analysis <strong>of</strong> HumanGenome-Wide Expression and GeneticVariation DataGad AbrahamSubmitted in total fulfilment <strong>of</strong> the requirements<strong>of</strong> the degree <strong>of</strong> Doctor <strong>of</strong> PhilosophyMarch 2012Department <strong>of</strong> Computing and In<strong>for</strong>mation SystemsThe University <strong>of</strong> MelbourneProduced on archival quality paper


AbstractOne <strong>of</strong> the major tasks in bioin<strong>for</strong>matics and computational biology is prediction <strong>of</strong> phenotypefrom molecular data. Predicting phenotypes such as disease opens the way to betterdiagnostic tools, potentially identifying disease earlier than would be detectable using othermethods. Examining molecular signatures rather than clinical phenotypes may help refinedisease classification and prediction procedures, since many diseases are known to have multiplemolecular subtypes with differing etiology, prognosis, and treatment options. Beyondprediction itself, identifying predictive markers aids our understanding <strong>of</strong> the biological mechanismsunderlying phenotypes such as disease, generating hypotheses that can be tested inthe lab.The aims <strong>of</strong> this thesis are to develop effective and efficient computational and statisticaltools <strong>for</strong> analysing large scale gene expression and genetic datasets, with an emphasis onpredictive models. Several key challenges include high dimensionality <strong>of</strong> the data, whichhas important statistical and computational implications, noisy data due to measurementerror and stochasticity <strong>of</strong> the underlying biology, and maintaining biological interpretabilitywithout sacrificing predictive per<strong>for</strong>mance. We begin by examining the problem <strong>of</strong> predictingbreast cancer metastasis and relapse from gene expression data. We present an alternativeapproach based on gene set statistics. Second, we address the problem <strong>of</strong> analysing large<strong>human</strong> case/control genetic (single nucleotide polymorphism) data, and present an efficientand scalable algorithm <strong>for</strong> fitting sparse models to large datasets. Third, we apply sparsemodels to genetic case/control datasets from eight complex <strong>human</strong> diseases, evaluating howeach one can be predicted from genotype. Fourth, we apply sparse models lasso methods toa multi-omic dataset consisting <strong>of</strong> genetic variation, gene expression, and serum metabolites,<strong>for</strong> reconstruction <strong>of</strong> genetic regulatory networks. Finally, we propose a novel multi-taskstatistical approach, intended <strong>for</strong> modelling multiple correlated phenotypes.In summary, this thesis discusses a range <strong>of</strong> predictive models and applies them to a <strong>wide</strong>range <strong>of</strong> problems, including gene expression, genetic, and multi-omic datasets. We demonstratethat such models, and particularly sparse models, are computationally feasible and canscale to large datasets, provide increased insight into the biological causes <strong>of</strong> disease, and <strong>for</strong>some diseases have high predictive per<strong>for</strong>mance, allowing high-confidence disease diagnosis tobe made based on genetic data.iii


DeclarationThis is to certify that1. the thesis comprises only my original work towards the PhD except where indicated inthe Preface,2. due acknowledgement has been made in the text to all other material used,3. the thesis is less than 100,000 words in length, exclusive <strong>of</strong> tables, maps, bibliographiesand appendicesSignedv


PrefaceThis thesis incorporates the following publications:• Chapter 4 is substantially based on: G. Abraham, A. Kowalczyk, S. Loi, I. Haviv,and J. Zobel. Prediction <strong>of</strong> breast cancer prognosis using gene set statistics providessignature stability and biological context. BMC Bioinfo., 11:277, 2010. (primaryauthor: G. Abraham).• Chapter 5 is substantially based on: G. Abraham, A. Kowalczyk, J. Zobel, and M. Inouye.SparSNP: Fast and memory-efficient <strong>analysis</strong> <strong>of</strong> all SNPs <strong>for</strong> phenotype prediction.BMC Bioinfo., 13:88, 2012 (primary author: G. Abraham).• Chapters 6 is substantially based on: G. Abraham, A. Kowalczyk, J. Zobel, and M. Inouye.Sparse linear models to explain phenotypic variance and predict complex disease.in “NIPS Personalized Medicine Workshop 2011”, December 16th, 2011, Granada,Spain. (primary author: G. Abraham).• Chapter 6 is partly based on: G. Abraham, A. Kowalczyk, J. Zobel, and M. Inouye.Sparse linear models to explain phenotypic variance and predict complex disease (expandedjournal version). 2012. Under peer review. (primary author: G. Abraham).vii


AcknowledgmentThanks are due to my supervisors, Pr<strong>of</strong>essor Justin Zobel and Dr Adam Kowalczyk. Justintaught me how to think about scientific questions, encouraged me to dig deeper in interpretingthe work <strong>of</strong> my own and others, and helped me communicate science better. Adam’s deeptechnical knowledge, experience, and determination were invaluable lessons to me and inspiredme in my work. Thanks also to Dr Michael Inouye. Mike’s <strong>wide</strong>-ranging knowledge, drive, andgenerosity have helped shape both my work and my approach to science in general. Thanksto Dr Izhak Haviv, whose passion <strong>for</strong> science is obvious to all who know him, and who wasalways willing to help and provide advice, at any time <strong>of</strong> the day (or night). Thanks to thehead <strong>of</strong> my PhD committee, Pr<strong>of</strong>essor Peter Stuckey, <strong>for</strong> patiently guiding me in my PhDprocess towards better research and a better thesis.Thanks also go to my fellow students: Raj Gaire, Fan Shi, Gerard Wong, Ben Goudey,Shanika Kuruppu, Ge<strong>of</strong>f Macintyre, and Justin Bedo. They have provided me outlets fromthe sometimes isolating PhD process, and made the process much more enjoyable.Thanks to Matthias Reumann from IBM Research and David Bannon from the VictorianLife Sciences Computing Initiative (VLSCI) <strong>for</strong> supporting my work, and thanks toVLSCI (project VR0126) and the Victorian Partnership <strong>for</strong> Advanced Computing (VPAC)<strong>for</strong> providing high-per<strong>for</strong>mance computing facilities. Thanks to NICTA and the University <strong>of</strong>Melbourne <strong>for</strong> scholarship funding and travel grants. Thanks to David A. van Heel (QueenMary, University <strong>of</strong> London) <strong>for</strong> generously supplying the celiac disease data used in thisthesis.Finally, thanks are due to my family, and especially to my wife Laura, and my children Oriand Abigail, <strong>for</strong> their endless patience and love that have allowed me to complete this thesis.Funding StatementThis work was supported by the Australian Research Council, and by the NICTA VictorianResearch Laboratory. NICTA is funded by the Australian Government as represented by theDepartment <strong>of</strong> Broadband, Communications, and the Digital Economy, and the AustralianResearch Council through the ICT Centre <strong>of</strong> Excellence program. This work was made possiblethrough Victorian State Government Operational Infrastructure Support and Australianix


Government NHMRC IRIIS. Michael Inouye was supported by an NHMRC Biomedical AustralianTraining Fellowship (no. 637400). This study makes use <strong>of</strong> data generated by the WellcomeTrust Case-Control Consortium. A full list <strong>of</strong> the investigators who contributed to thegeneration <strong>of</strong> the data is available from www.wtccc.org.uk. Funding <strong>for</strong> the project was providedby the Wellcome Trust under award 076113 and 085475. Funding support <strong>for</strong> the GAINSearch <strong>for</strong> Susceptibility Genes <strong>for</strong> Diabetic Nephropathy in Type 1 Diabetes (GoKinD studyparticipants) study was provided by the Juvenile Diabetes Research Foundation (JDRF), andthe Centers <strong>for</strong> Disease Control (CDC) (PL 105-33, 106- 554, and 107-360 administered by theNational Institute <strong>of</strong> Diabetes and Digestive and Kidney Diseases, NIDDK) and the genotyping<strong>of</strong> samples was provided through the Genetic Association In<strong>for</strong>mation Network (GAIN).The dataset(s) used <strong>for</strong> the analyses described in this manuscript were obtained from thedatabase <strong>of</strong> Genotypes and Phenotypes (dbGaP) found at http://www.ncbi.nlm.nih.gov/gapthrough dbGaP accession number phs000018.v1.p1. Samples and associated phenotype data<strong>for</strong> the Search <strong>for</strong> Susceptibility Genes <strong>for</strong> Diabetic Nephropathy in Type 1 Diabetes studywere provided by James H. Warram, MD, Joslin Diabetes Center.x


ContentsList <strong>of</strong> Abbreviationsxxix1. Introduction 12. Biological Background 92.1. The Central Dogma <strong>of</strong> Molecular Biology . . . . . . . . . . . . . . . . . . . . . . 92.2. The Molecular Basis <strong>for</strong> Disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3. Gene Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3.1. Measuring Gene Expression . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3.2. Challenges in Analysis <strong>of</strong> Gene Expression Microarrays . . . . . . . . . 142.4. The Genetic Basis <strong>of</strong> Disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.4.1. Linkage Disequilibrium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.4.2. Hardy-Weinberg Equilibrium . . . . . . . . . . . . . . . . . . . . . . . . . 242.4.3. SNP Microarray Technology . . . . . . . . . . . . . . . . . . . . . . . . . . 252.4.4. Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.4.5. Genome Wide Association Studies . . . . . . . . . . . . . . . . . . . . . . 272.4.6. Challenges in Analysis <strong>of</strong> SNP Microarrays . . . . . . . . . . . . . . . . . 292.4.7. The Problem <strong>of</strong> Missing Heritability . . . . . . . . . . . . . . . . . . . . . 312.4.8. Expression Quantitative-Trait Loci . . . . . . . . . . . . . . . . . . . . . . 332.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343. Review <strong>of</strong> the Analysis <strong>of</strong> Gene Expression and Genetic Data 353.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.2. Supervised Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.3. Linear Models and Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 383.4. Feature Selection — Finding Predictive & Causal Markers . . . . . . . . . . . . 403.4.1. Filter Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.4.2. Wrapper Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.4.3. Embedded Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.4.4. Other Methods <strong>for</strong> Dimensionality Reduction . . . . . . . . . . . . . . . 513.4.5. Other Methods <strong>for</strong> Classification and Regression . . . . . . . . . . . . . 52xi


Contents5.6. S<strong>of</strong>tware Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1155.7. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1176. Sparse Linear Models Explain Phenotypic Variation and Predict Risk <strong>of</strong> ComplexDisease 1196.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1196.2. Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1206.2.1. Genetic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1206.2.2. HAPGEN2 simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1236.2.3. Positive and negative predictive values . . . . . . . . . . . . . . . . . . . 1236.2.4. Genomic Inflation Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1246.2.5. Data and quality control . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1256.3. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1256.3.1. Recovery <strong>of</strong> Causal SNPs in Simulation . . . . . . . . . . . . . . . . . . . 1266.3.2. Modelling <strong>genome</strong>-<strong>wide</strong> pr<strong>of</strong>iles <strong>for</strong> eight complex diseases . . . . . . . . 1266.3.3. Assessment <strong>of</strong> confounding factors . . . . . . . . . . . . . . . . . . . . . . 1296.3.4. Discrimination <strong>of</strong> the phenotype in cross-validation . . . . . . . . . . . . 1306.3.5. Genetic models in a population context . . . . . . . . . . . . . . . . . . . 1356.3.6. Genetic substructure <strong>of</strong> celiac disease and type 1 diabetes . . . . . . . . 1366.4. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1376.5. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1387. Genetic Control <strong>of</strong> the Human Metabolic Gene Regulation 1417.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1417.2. Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1437.2.1. Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1437.2.2. Predictive Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1457.2.3. Causal Network Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 1487.3. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1507.3.1. Predictive Models <strong>of</strong> Metabolites using Gene Expression . . . . . . . . . 1507.3.2. Integrating the Metabolite Models with Models <strong>of</strong> Gene Expressionbased on SNPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1557.3.3. Linking the Causal Networks to Fasting Glucose Levels and Type 2Diabetes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1587.4. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1678. Fused Multitask Penalised Regression 1718.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1718.2. Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173xiii


Contents8.3. Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1748.3.1. Fused Multitask Penalised Regression . . . . . . . . . . . . . . . . . . . . 1748.3.2. Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1768.3.3. Computational Enhancements . . . . . . . . . . . . . . . . . . . . . . . . . 1778.4. Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1778.5. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1828.5.1. Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1828.5.2. Experiments on the DILGOM Dataset . . . . . . . . . . . . . . . . . . . 1848.5.3. Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1968.6. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1979. Conclusions 199A. Supplementary Results <strong>for</strong> Gene Set Statistics 205A.1. Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205A.2. Internal Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206A.3. External Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206B. Supplementary Results <strong>for</strong> Sparse Linear Models 217B.1. Scoring Measures <strong>for</strong> Causal SNP Detection . . . . . . . . . . . . . . . . . . . . . 217B.2. Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219B.2.1. Checking <strong>for</strong> Stratification . . . . . . . . . . . . . . . . . . . . . . . . . . . 219B.2.2. AUC <strong>for</strong> Stringent Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . 220B.2.3. PPV/NPV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221B.2.4. Comparison with Other Methods . . . . . . . . . . . . . . . . . . . . . . . 222B.2.5. Principal Component Analysis <strong>of</strong> Cases . . . . . . . . . . . . . . . . . . . 240B.3. Results <strong>for</strong> each dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242B.3.1. Bipolar Disorder (BD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243B.3.2. Coronary Artery Disease (CAD) . . . . . . . . . . . . . . . . . . . . . . . 244B.3.3. Celiac Disease (Celiac) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245B.3.4. Crohn’s Disease/Inflammatory Bowel Disease (Crohn’s) . . . . . . . . . 247B.3.5. Hypertension (HT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248B.3.6. Rheumatoid Arthritis (RA) . . . . . . . . . . . . . . . . . . . . . . . . . . 249B.3.7. Type 1 Diabetes (WTCCC-T1D) . . . . . . . . . . . . . . . . . . . . . . . 250B.3.8. Type 2 Diabetes (T2D) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251C. Supplementary Results <strong>for</strong> FMPR 253Bibliography 255xiv


List <strong>of</strong> Figures2.1. The Central Dogma <strong>of</strong> molecular biology. . . . . . . . . . . . . . . . . . . . . . . 102.2. An outline <strong>of</strong> the gene expression microarray experiment, <strong>for</strong> spotted cDNA(left) and oligonucleotide arrays (right). Reprinted by permission from MacmillanPublishers Ltd (Staal et al., 2003), copyright (2003). . . . . . . . . . . . . 142.3. A revised Central Dogma <strong>of</strong> molecular biology. We distinguish between aclinical phenotype, which is a high level phenotype such as case/control status,and other phenotypes — many <strong>of</strong> the other nodes can be considered phenotypesin their own right, such as gene expression (mRNA) in eQTL studies. . . . . . 192.4. Different phenotypes are characterised by different combinations <strong>of</strong> variantfrequency and effect size. Reprinted by permission from Macmillan PublishersLtd (Manolio et al., 2009), copyright (2009). . . . . . . . . . . . . . . . . . . . . 212.5. Association <strong>for</strong> SNPs in chromosome 13 (q22.1) with pancreatic cancer. Thediamonds, and squares represent the log 10 p-values <strong>for</strong> association <strong>of</strong> SNPs.Overlaid are the recombination rates (centimorgan per megabase). On thebottom is a heatmap showing the LD between the SNPs, measured by r 2 .Reprinted by permission from Macmillan Publishers Ltd (Petersen et al., 2010),copyright (2010). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.6. Measured intensities <strong>for</strong> two alleles <strong>of</strong> one locus, over all samples in the 1958birth cohort <strong>of</strong> the WTCCC data (The Wellcome Trust Case Control Consortium,2007). The samples coloured red, green, and blue are called as BB,AB, and AA, respectively. The light blue colour represents missing calls (CHI-AMO genotype calls made with posterior probability < 0.9). The left andright panels shows the calls be<strong>for</strong>e and after imputing the missing calls, respectively.Reprinted by permission from Macmillan Publishers Ltd (Marchini etal., 2007), copyright (2007). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.7. The family-wise Type 1 error rate <strong>for</strong> k independent tests. The per-test thresholdis α = 0.05. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28xv


List <strong>of</strong> Figures3.1. An illustration <strong>of</strong> the relationship between true and empirical risk as modelcomplexity increases. The Bayes risk is shown constant since it assumes a fixedmodel complexity, which is the “correct” (but unknown) model. Empirical riskis the risk observed <strong>for</strong> a given model in a given finite dataset. On the far lefthand side, the model can be said to be underfitting, as the empirical risk ishigher than the true risk. On the right hand side, the model is overfitting, asit has lower empirical risk than the true risk. . . . . . . . . . . . . . . . . . . . 373.2. Four loss functions <strong>for</strong> classification: 0/1 loss L(z i ) = I(z i < 1), logistic regressionL(z i ) = log(1 + exp(−z i )), hinge loss L(z i ) = max{0, 1 − z i }, and squaredhinge loss L(z i ) = max{0, 1−z i } 2 , where z i = y i (β 0 +x T i β) <strong>for</strong> linear models. In<strong>for</strong>mally,<strong>for</strong> z ≥ 1 the predicted and observed classes match sign(ŷ i ) = sign(y i )(correct classification), and <strong>for</strong> z < 1 they do not match (mis-classification). . 393.3. A toy example <strong>of</strong> a three-gene network. If gene A is mutated and causes adownstream effect in genes B and C, then all three genes may appear to beassociated with the phenotype, even though gene B is clearly non-causal. . . 413.4. A contingency table <strong>of</strong> two alleles versus the case-control status, in terms <strong>of</strong>(a) counts, (b) conditional probabilities Pr(y∣x), and (c) the odds. . . . . . . . 433.5. Penalised squared loss in 2 dimensions. The red contours show the curves <strong>of</strong>constant loss <strong>for</strong> different solution pairs (β 1 , β 2 ), and β ∗ is the unpenalisedsolution. Also shown are the feasible regions (in cyan) imposed by the ridgeconstraints β 2 1 + β2 2 ≤ t (left), and by the lasso constraints ∣β 1∣ + ∣β 2 ∣ ≤ t (right).Adapted from Hastie et al. (2009a). . . . . . . . . . . . . . . . . . . . . . . . . . 484.1. A heatmap showing differentially expressed genes (rows) over a subset <strong>of</strong> 250samples (columns) from the five breast cancer datasets. Differential expressionwas determined using linear models in limma (Smyth, 2005).Samples arecoloured red and blue <strong>for</strong> < 5 years and ≥ 5 years to metastasis, respectively.Under and over expressed genes are coloured red and green, respectively. . . 604.2. Schematic <strong>of</strong> how the gene set features are constructed from three gene sets S 1(red), S 2 (green), and S 3 (blue), each with 2, 3, and 1 gene/s, respectively. Notethat <strong>for</strong> clarity we show non-overlapping sets, although the sets can overlap inpractice. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.3. Average and 95% confidence intervals <strong>for</strong> AUC from external validation betweenthe five datasets, n = 2 × ( 5 )2= 20 (train,test) pairs, <strong>for</strong> different numbers<strong>of</strong> features. We show only every second confidence interval <strong>for</strong> clarity. Notethat each dataset ranks its features independently, hence, the kth feature isnot necessarily the same across datasets. Individual genes are denoted raw. . 73xvi


List <strong>of</strong> Figures4.4. Variance and 95% confidence intervals <strong>of</strong> the AUC from external validationbetween the five datasets, n = 2 × ( 5 )2= 20 (train,test) pairs, <strong>for</strong> differentnumbers <strong>of</strong> features. The confidence intervals are [(n − 1)s 2 /χ 2 α/2,n−1 , (n −1)s 2 /χ 2 1−α/2,n−1 ], where χ2 α is the α = 0.05 quantile <strong>for</strong> a chi-squared distributionwith n − 1 degrees <strong>of</strong> freedom, and s 2 is the sample variance. . . . . . 754.5. Mean and 2.5%/97.5% <strong>of</strong> the ranks <strong>of</strong> genes and gene sets. Ranks are based onthe weight assigned by the centroid classifier to each feature. For gene sets, weused the set centroid statistic. The process was repeated over 5000 bootstrapreplications <strong>of</strong> the GSE4922 dataset. Features have been sorted by their meanrank. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 764.6. Spearman rank-correlation <strong>of</strong> the centroid classifier’s weights from the fivedatasets (n = 10 comparisons). Individual genes are denoted raw. . . . . . . . 774.7. Concordance <strong>of</strong> feature lists (genes or gene sets) <strong>for</strong> different cut<strong>of</strong>fs f =1, . . . , 200, counting the number <strong>of</strong> features occurring in all <strong>of</strong> the five datasets’lists, ranked higher than f. We use raw to denote individual genes. Priorto ranking, we selected 4120 genes (<strong>for</strong> the raw lists) or gene sets (<strong>for</strong> the setstatistics) to be ranked, so that the number unique <strong>of</strong> items was identical acrossall lists. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814.8. Kolmogorov-Smirnov enrichment <strong>for</strong> MSigDB categories, using the set-centroidstatistic. (A) AUC and spline smooth <strong>for</strong> each set, tested on GSE11121. (B)Number <strong>of</strong> mapped probesets in each set, on log 2 scale, and spline smooth. (C)Two-sample Kolmogorov-Smirnov Brownian-bridge <strong>for</strong> each MSigDB category(p-values: C1: 1.44×10 −4 , C2: 3.55×10 −15 , C3: < 2.22×10 −16 , C4: 4.22×10 −13 ,C5: 2.38×10 −2 ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 844.9. AUC and weight versus set size <strong>for</strong> the set centroid statistic, using the centroidclassifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 864.10. Expression <strong>of</strong> ESR1 (ER) versus ERBB2 (HER2) <strong>for</strong> the combined dataset. Amixture <strong>of</strong> three Gaussians is fitted to the data. Clusters 1, 2, and 3 representthe ER−/HER2−, ER+/HER2−, and HER2+ subtypes, respectively. . . . . . 885.1. Time (in seconds) <strong>for</strong> model fitting, over sub-samples <strong>of</strong> the entire celiac diseasedataset, taken as the minimum over 10 independent runs. (a) For all methodsincluding hyperlasso. (b) Excluding hyperlasso. For in-memory methodswe included the time to read the binary data into R. For SparSNP and glmnetwe used a λ grid <strong>of</strong> size 20, and a maximum model size <strong>of</strong> 2048 SNPs. liblinearused C = 1. hyperlasso used one iteration with λ = 1 (DE prior). The leftpanel includes all four methods, the right panel excludes hyperlasso. Theinsets show the leftmost panel (50,000 SNPs) on its own scale to better visualisethe differences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112xvii


List <strong>of</strong> Figures5.2. Left: LOESS-smoothed AUC and explained phenotypic variance (denoted “Var-Exp”) <strong>for</strong> the Finnish celiac disease dataset, <strong>for</strong> increasing model sizes. Forliblinear-cdblock (LL-CD-L2), all 516,504 SNPs are included in the model.AUC is estimated over 30 × 3-fold cross-validation. The explained phenotypicvariance is estimated from the AUC using the method <strong>of</strong> Wray et al. (2010),assuming a population prevalence <strong>of</strong> K = 1%. . . . . . . . . . . . . . . . . . . . 1145.3. An example pipeline <strong>for</strong> analysing a SNP discovery dataset with SparSNPand testing the model on a validation dataset. Most <strong>of</strong> the data preparationand processing can be done with PLINK. . . . . . . . . . . . . . . . . . . . . . . 1166.1. APRC <strong>for</strong> HAPGEN simulations, using either lasso squared-hinge loss models(lasso) or the univariable logistic regression Wald test (univariable). For thelasso, different numbers <strong>of</strong> SNPs are allowed in the model, as determined by thepenalty λ. For the univariable test, all SNPs are considered. For lasso, resultsare smoothed using LOESS over the replications. For univariable, results areaveraged over the replications. The dotted vertical lines show the number <strong>of</strong>true “causal” SNPs in the data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1276.2. (a) Area under the receiver operating characteristic Curve (AUC) <strong>for</strong> models<strong>of</strong> the 9 case/control datasets. Results are LOESS-smoothed over 20 × 3-foldcross-validation. See the Supplementary Results <strong>for</strong> details on each disease.(b) LOESS-smoothed proportion <strong>of</strong> phenotypic variance explained <strong>for</strong> the lassomodels <strong>for</strong> the 9 discovery datasets, using the method <strong>of</strong> Wray et al. (2010). 1316.3. Lasso models can achieve high positive predictive values. PPV versus NPV<strong>for</strong> the lasso models <strong>of</strong> the 9 discovery datasets. Results are averaged over20 × 3CV. See the Supplementary Results <strong>for</strong> the number <strong>of</strong> SNPs with nonzerocoefficients in each dataset. Note that the curves do not span the entirerange <strong>of</strong> NPV since not all sensitivity and specificity values can be observed ina finite dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1336.4. Genetic subclasses <strong>of</strong> celiac disease cases exhibit high predictability. The PCsare obtained from PCA <strong>of</strong> the genotypes belonging to the SNPs identified bythe lasso models with ∼100 SNPs in cross-validation, <strong>for</strong> (a) the original Celiac1dataset and (b) a stringently-filtered version <strong>of</strong> the Celiac1 dataset. Sampleswith a median specificity ≥ 0.99 in prediction <strong>of</strong> cases are highlighted in red. 1367.1. Schematic diagram <strong>of</strong> our <strong>analysis</strong> pipeline. . . . . . . . . . . . . . . . . . . . . 1447.2. Decision tree <strong>for</strong> inferring the causal graph structure based on the pattern <strong>of</strong>marginal and partial correlations, assuming that cis-QTLs are causal to thegene. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150xviii


List <strong>of</strong> Figures7.3. R 2 <strong>for</strong> regressing the metabolites on all gene probes, together with all clinicalvariables (model 1), or after removing the effect <strong>of</strong> the clinical variables(model 2), showing the top 10 <strong>for</strong> each model. The results <strong>for</strong> all metabolitesare shown in the insets. Metabolites were sorted in descending order <strong>of</strong> R 2 . R 2was estimated with nested cross-validation. Note the different scales. . . . . . 1517.4. The top 10 variables (metabolites+genes <strong>for</strong> model 1 and genes <strong>for</strong> model 2) asselected as predictors <strong>of</strong> metabolite variation in models 1 and 2. The genes wereranked by the proportion <strong>of</strong> metabolites <strong>for</strong> which each gene was selected bythe lasso regression (a, b) or the proportion times the R 2 <strong>for</strong> the correspondingmetabolite (c, d), in order to upweight genes that are not only included aspredictors <strong>of</strong> many metabolites but are also more highly predictive. Each insetshows all the variables <strong>for</strong> each model. Note the different scales. . . . . . . . . 1527.5. Ratio <strong>of</strong> R 2 in model 2 to R 2 in model 1 <strong>of</strong> metabolites, sorted in decreasingorder. Large figure: metabolites with ratios ≥ 0.5. Inset: all 98 metaboliteswith positive R 2 in model 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1547.6. Hierarchical clustering <strong>of</strong> the metabolites, using complete linkage. . . . . . . . 1567.7. Box-and-whisker plots <strong>of</strong> R 2 in model 2 each metabolite cluster, predictingmetabolite concentrations from gene expression. . . . . . . . . . . . . . . . . . . 1577.8. Box-and-whisker plots <strong>of</strong> cross-validated R 2 <strong>for</strong> the stable genes associated witheach metabolite (predicted from the SNPs), compared with an aggregation <strong>of</strong>all metabolites (“All”) and a random set <strong>of</strong> genes (“Random”). Also shown isthe number <strong>of</strong> stable genes associated with each metabolite. . . . . . . . . . . 1597.9. Inferred network <strong>of</strong> regulation <strong>for</strong> serum triglycerides. Inferred causal edges areshown as solid edges. Dashed edges represent trans-QTLs, where direct causaleffect on the gene cannot be inferred. The edges widths are proportional tothe R 2 <strong>of</strong> the marginal association between the nodes from a univariable linearregression (in parentheses). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1607.10. Metabolites selected as stably associated with fasting glucose, as selected bylasso regression, correcting <strong>for</strong> the effect <strong>of</strong> the clinical variables. The edgeweights show the exponentiated weights exp(β), corresponding to increases in(a) fasting glucose and (b) odds ratio <strong>of</strong> fasting glucose ≥ 7 mmol/L, respectively,<strong>for</strong> a one standard deviation increase in each metabolite, averaged overthe cross-validation replications. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1627.11. Inferred causal networks <strong>for</strong> three metabolites stably associated with fastingglucose levels. The edge weights are the R 2 from a univariable linear regression<strong>of</strong> each child node on each parent node. The R 2 from a multivariable lasso linearregression on all inputs (SNPs <strong>for</strong> genes and genes <strong>for</strong> metabolites) is shown inparenthesis next to each node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165xix


List <strong>of</strong> Figures7.12. Top 5 principal components from PCA <strong>of</strong> the genotype data. . . . . . . . . . . 1668.1. An illustration <strong>of</strong> a hypothetical setup in which five genes G 1 , ..., G 5 are associatedwith five metabolites M 1 , ..., M 5 . Several metabolites share the same geneassociations (solid lines), and there<strong>for</strong>e are correlated with each other (correlationshown by dashed lines). By leveraging the inter-metabolite correlations,multi-task methods such as fmpr and GFlasso aim to better identify whichinputs (genes in this case) are truly associated with which outputs (metabolites),while avoiding spurious associations due to effects such as noise, underthe assumption that correlated outputs are caused by common regulators(pleiotropic genes). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1728.2. The solution path <strong>of</strong> fmpr <strong>for</strong> one parameter β j over K = 10 tasks, <strong>for</strong> increasingγ and with λ = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1758.3. An illustration <strong>of</strong> the three sparsity setups used in the multi-task simulations.Top row: absolute values <strong>of</strong> the p × K weight matrix B used <strong>for</strong> generatingthe outputs y, <strong>for</strong> models 1, 2, and 3, respectively (model 4 has identicalweights and correlations in absolute value to model 1). Bottom row: the K ×Kcorrelation matrices <strong>of</strong> the outputs y. . . . . . . . . . . . . . . . . . . . . . . . . 1818.4. Simulations with varying number <strong>of</strong> samples N (Setup 1), showing recovery<strong>of</strong> true causal variables (ROC/PRC) in the training data and R 2 in test setprediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1858.5. Simulations with varying levels <strong>of</strong> noise σ (Setup 2), showing recovery <strong>of</strong> truecausal variables (ROC/PRC) in the training data and R 2 in test set prediction. 1868.6. Simulations with varying number <strong>of</strong> tasks K (Setup 3), showing recovery <strong>of</strong> truecausal variables (ROC/PRC) in the training data and R 2 in test set prediction. 1878.7. Simulations with varying weights β (Setup 4), showing recovery <strong>of</strong> true causalvariables (ROC/PRC) in the training data and R 2 in test set prediction. . . . 1888.8. Simulations with varying number <strong>of</strong> parameters p (Setup 5), showing recovery<strong>of</strong> true causal variables (ROC/PRC) in the training data and R 2 in test setprediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1898.9. Simulations with same sparsity but different weights β (Setup 6), showingrecovery <strong>of</strong> true causal variables (ROC/PRC) in the training data and R 2 intest set prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1908.10. Simulations with unrelated tasks (Setup 7), showing recovery <strong>of</strong> true causalvariables (ROC/PRC) in the training data and R 2 in test set prediction. . . . 1918.11. Simulations with a mixture <strong>of</strong> positively and negatively correlated tasks (roughly50%/50% each, Setup 8), showing recovery <strong>of</strong> true causal variables (ROC/PRC)in the training data and R 2 in test set prediction. . . . . . . . . . . . . . . . . 192xx


List <strong>of</strong> Figures8.12. The true non-zero simulation weights β, and the weights estimated by eachmethod in one replication <strong>of</strong> the reference setup. The intensity <strong>of</strong> the lines representsthe absolute value <strong>of</strong> the estimated weight ˆβ. The vertical and horizontalaxes correspond to variables j = 1, ..., p and tasks k = 1, ..., K, respectively.Note that the weights <strong>of</strong> the lasso model were all zero. . . . . . . . . . . . . . . 1938.13. Pearson correlations <strong>for</strong> 35 metabolites from cluster 1 <strong>of</strong> the DILGOM metabolites.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1948.14. Recovered weights matrices ˆB <strong>for</strong> 200 genes over the 35 metabolites <strong>for</strong> lassoand fmpr-w2, based on penalties optimised by cross-validation. The verticaland horizontal axes represent genes and tasks, respectively. The intensity <strong>of</strong>each point represents the absolute value <strong>of</strong> the weight β. . . . . . . . . . . . . 1958.15. Box-and-whisker plots <strong>of</strong> R 2 <strong>for</strong> fmpr-w2 and lasso over 35 metabolites fromcluster 1 <strong>of</strong> the DILGOM metabolites, using gene expression as inputs. Weused 10×5 nested cross-validation to produce 50 estimates. The stars representstatistical significance from a Bonferroni-corrected Wilcoxon rank-sum test p ≤0.05/35 = 0.00143. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1968.16. Average time to run fmpr over 50 independent replications. (a) p = 400,K = 10. (b) N = 100, K = 10. (c) N = 100, p = 100. The left panel ineach subplot show the wall time, the right panel show time scaled to the sameapproximate range in order to better show the trends. . . . . . . . . . . . . . . 198A.1. Internal validation (mean and 95% CI <strong>for</strong> AUC) <strong>for</strong> centroid classifier with RFE207A.2. Internal validation (mean and 95% CI <strong>for</strong> AUC) <strong>for</strong> SVM classifier, using allfeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208A.3. Internal validation (mean and 95% CI <strong>for</strong> AUC) <strong>for</strong> PAM classifier . . . . . . . 209A.4. Internal validation (mean and 95% CI <strong>for</strong> AUC) <strong>for</strong> VV1 classifier . . . . . . . 210A.5. Internal validation (mean and 95% CI <strong>for</strong> AUC) <strong>for</strong> VV2 classifier . . . . . . . 211A.6. External validation (mean and 95% CI <strong>for</strong> AUC) <strong>for</strong> all models . . . . . . . . . 212A.7. Kolmogorov-Smirnov plots <strong>for</strong> overlap between the gene sets and the modules<strong>of</strong> Desmedt et al. (2008). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213A.8. Heatmap <strong>of</strong> selected genes in the combined dataset (932 samples), showing thethree subclasses ER−/HER2−, ER+/HER2−, and HER2+. . . . . . . . . . . . . 214xxi


List <strong>of</strong> FiguresB.1. APRC <strong>for</strong> HAPGEN simulations, using either lasso squared-hinge loss models(lasso) or the univariable logistic regression Wald test (univariable). For thelasso, different numbers <strong>of</strong> SNPs are allowed in the model, as determined by thepenalty λ. For the univariable test, all SNPs are considered. For lasso, resultsare smoothed using LOESS over the replications. For univariable, results areaveraged over the replications. The dotted vertical lines show the number <strong>of</strong>true “causal” SNPs in the data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 225B.2. AUC <strong>for</strong> HAPGEN simulations, using either lasso squared-hinge loss models(lasso) or the univariable logistic regression Wald test (univariable). For thelasso, different numbers <strong>of</strong> SNPs are allowed in the model, as determined by thepenalty λ. For the univariable test, all SNPs are considered. For lasso, resultsare smoothed using LOESS over the replications. For univariable, results areaveraged over the replications. The dotted vertical lines show the number <strong>of</strong>true “causal” SNPs in the data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 226B.3. The first 5 principal components (PCs) <strong>of</strong> (a) original Celiac1 data and (b)after removing high LD regions, thinning, and regression <strong>of</strong> previous SNPs.The strong structure in the top PCs is largely removed by accounting <strong>for</strong> LD.PCs 6–10 were only weakly predictive <strong>of</strong> the phenotype and are not shown <strong>for</strong>clarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227B.4. PCA loadings per chromosome <strong>for</strong> each <strong>of</strong> the top 10 PCs (a) original Celiac1data (b) pruned Celiac1 data. Note the different scales on the y-axis. . . . . . 228B.5. 10-fold cross-validated AUC <strong>for</strong> prediction <strong>of</strong> case/control status from the top10 principal components <strong>of</strong> the Celiac1 dataset, using lasso logistic regressionwith glmnet (Friedman et al., 2010), selecting increasing number <strong>of</strong> principalcomponents (right to left) (a) original dataset and (b) after LD-pruning. . . . 229B.6. LOESS-smoothed (with 95% pointwise confidence intervals about the mean)AUC <strong>for</strong> lasso models <strong>of</strong> stringently-filtered (a) Celiac1 and Celiac2-UK and(b) WTCCC-T1D, both in 30 × 3-fold cross-validation. . . . . . . . . . . . . . . 230B.7. LOESS-smoothed AUC <strong>for</strong> models in 20×3-fold cross-validation. . . . . . . . . 231B.8. Averaged PPV/NPV <strong>for</strong> models in 20×3-fold cross-validation. . . . . . . . . . 232B.9. Summary plots <strong>of</strong> one fold <strong>of</strong> cross-validation prediction in the WTCCC-T1Ddata. The fourth panel shows the PPV in rank order <strong>of</strong> NPV, to better highlightthe samples with PPV=1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233B.10.Summary plots <strong>of</strong> one fold <strong>of</strong> cross-validation prediction in the Celiac1 data.The fourth panel shows the PPV in rank order <strong>of</strong> NPV, to better highlight thesamples with PPV=1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234xxii


List <strong>of</strong> FiguresB.11.Summary plots <strong>of</strong> one fold <strong>of</strong> cross-validation prediction in the Celiac1 dataafter stringent filtering. The fourth panel shows the PPV in rank order <strong>of</strong>NPV, to better highlight the samples with PPV=1. . . . . . . . . . . . . . . . 235B.12.Summary plots <strong>of</strong> one fold <strong>of</strong> cross-validation prediction in the Celiac2-UKdata. The fourth panel shows the PPV in rank order <strong>of</strong> NPV, to better highlightthe samples with PPV=1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236B.13.Summary plots <strong>of</strong> one fold <strong>of</strong> cross-validation prediction in the Celiac2-UKdata after stringent filtering. The fourth panel shows the PPV in rank order<strong>of</strong> NPV, to better highlight the samples with PPV=1. . . . . . . . . . . . . . . 237B.14.LOESS-smoothed proportion explained phenotypic variance, over 20×3-foldcross-validation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238B.15.LOESS-smoothed AUC <strong>for</strong> lasso squared-hinge loss classifier and logistic regression<strong>for</strong> random subsamples <strong>of</strong> the T1D data. For each prespecified sizeN ∈ {50, 100, 200, 400, 800, 1600, 3200}, we randomly sampled the original 4901samples (without replacement) to <strong>for</strong>m a smaller dataset. The subsamplingwas repeated 30 times <strong>for</strong> N = 50, 20 times <strong>for</strong> N = 100, and 10 times <strong>for</strong> therest. Within each subsampled dataset, we ran 10 × 3CV to evaluate the AUC(<strong>for</strong> example, 30 × 10 × 3CV <strong>for</strong> N = 50). For N = 4901, we used the originaldataset without sampling, running 20 × 3CV. . . . . . . . . . . . . . . . . . . . 239B.16.Principal Component Analysis (PCA) <strong>of</strong> the cases only, using the top 100SNPs identified by the lasso <strong>for</strong> the Celiac1 and Celiac2-UK datasets, andtheir stringently-filtered versions. Samples are colored by median specificity inthe cross-validation replications: median specificity≥ 0.99 (red), and the rest(black). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240B.17.Principal Component Analysis (PCA) <strong>of</strong> the cases only, using the top 100 SNPsidentified by the lasso <strong>for</strong> T1D. Samples are colored by median specificity inthe cross-validation replications: median specificity≥ 0.99 (red), and the rest(black). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241B.18.AUC, PPV/NPV, and explained phenotypic variance <strong>for</strong> Bipolar Disease. . . 243B.19.AUC, PPV/NPV, and explained phenotypic variance <strong>for</strong> Coronary Artery Disease.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244B.20.AUC, PPV/NPV, and explained phenotypic variance <strong>for</strong> Celiac1. . . . . . . . 245B.21.AUC, PPV/NPV, and explained phenotypic variance <strong>for</strong> Celiac2-UK. . . . . . 246B.22.AUC, PPV/NPV, and explained phenotypic variance <strong>for</strong> Crohn’s. . . . . . . . 247B.23.AUC, PPV/NPV, and explained phenotypic variance <strong>for</strong> Hypertension. . . . 248B.24.AUC, PPV/NPV, and explained phenotypic variance <strong>for</strong> Rheumatoid Arthritis. 249B.25.AUC, PPV/NPV, and explained phenotypic variance <strong>for</strong> Type 1 Diabetes. . 250B.26.AUC, PPV/NPV, and explained phenotypic variance <strong>for</strong> Type 2 Diabetes. . 251xxiii


List <strong>of</strong> FiguresC.1. Time to run fmpr over 50 independent replications. (a) p = 400, K = 10. (b)N = 100, K = 10. (c) N = 100, p = 100. . . . . . . . . . . . . . . . . . . . . . . . . 253xxiv


List <strong>of</strong> Tables4.1. Clinical and demographic characteristics <strong>of</strong> the patients in the five breast cancerdatasets. Samples were removed if they were censored be<strong>for</strong>e the 5-year cut<strong>of</strong><strong>for</strong> were treated with adjuvant therapy. The clinical summaries are <strong>for</strong> thecleaned version <strong>of</strong> the data (post-removal). Grade: histologic grade. Therapy:neoadjuvant or adjuvant therapy. 1Q: first quartile; Med: median; 3Q: thirdquartile. ER status: estrogen receptor status. . . . . . . . . . . . . . . . . . . . 644.2. The gene set statistics used in this work. . . . . . . . . . . . . . . . . . . . . . . 714.3. Top 10 gene sets by average rank over the five datasets, using the set centroidstatistic. GO enrichment p-values are from a Bonferroni-adjusted one-sidedFisher’s exact test (30,330 tests). Sign=−1 if expression is negatively associatedwith long-term survival, and is +1 otherwise. The background list <strong>for</strong> the testincludes all Affymetrix HG-U133A probesets that could be mapped to GO BPterms, excluding IEA annotations. . . . . . . . . . . . . . . . . . . . . . . . . . . 804.4. Breakdown <strong>of</strong> samples <strong>for</strong> each cancer subtype . . . . . . . . . . . . . . . . . . . 854.5. Top 10 MSigDB sets <strong>for</strong> ER/HER2 molecular subtypes, chosen by the centroidclassifier using the set centroid statistic. Sign=−1 if expression is negativelyassociated with long-term survival, and +1 <strong>for</strong> positive association with longtermsurvival. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 904.6. Top 10 sets using the set centroid statistic using different classifiers, and thep-value <strong>for</strong> the size <strong>of</strong> the intersection between the top individual genes andthe top gene sets (Fisher’s exact test, one sided). CC is centroid classifier, LRis logistic regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 916.1. List <strong>of</strong> discovery datasets used in this <strong>analysis</strong>. The 1958 British Birth Cohort(N = 1480) and the National Blood Service (N = 1458) datasets wereused as shared controls <strong>for</strong> all WTCCC datasets.† Celiac1 used IlluminaHumanHap33v1-1 <strong>for</strong> cases and HumanHap550-2v3 <strong>for</strong> controls, and Celiac2-UK used Illumina 670-QuadCustom-v1 <strong>for</strong> cases and Illumina 1.2M-DuoCustomv1<strong>for</strong> controls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128xxv


List <strong>of</strong> Tables6.2. List <strong>of</strong> independent replication datasets used. The National Blood Service†(N = 1458) dataset was used as controls <strong>for</strong> the GoKinD-T1D dataset.Celiac2-IT and Celiac2-NL used Illumina 670QuadCustom-v1 <strong>for</strong> cases andcontrols, Celiac2-Finn used 670-QuadCustom-v1 <strong>for</strong> cases and 610-Quad <strong>for</strong>controls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1296.3. AUC and explained phenotypic variance <strong>for</strong> independent validation datasets<strong>of</strong> celiac disease models trained on Celiac1. We used models with ∼200 SNPsin the model, trained in cross-validation on Celiac1 and tested on subsets <strong>of</strong>the Celiac2 dataset. LCL: lower confidence limit. UCL: upper confidencelimit. The proportion <strong>of</strong> explained phenotypic variance assumes populationprevalence K = 1%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1326.4. AUC and explained phenotypic variance <strong>for</strong> independent validation datasets<strong>of</strong> celiac disease models trained on Celiac2-UK. Models were trained in crossvalidationon the UK subset <strong>of</strong> the Celiac2 datasets, and tested on the otherthree subsets <strong>of</strong> the Celiac2 dataset. LCL: lower confidence limit. UCL: upperconfidence limit. The proportion <strong>of</strong> explained phenotypic variance assumespopulation prevalence K = 1%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1346.5. Models were trained in cross-validation on the WTCCC-T1D dataset andtested on the GoKinD-T1D dataset, using ∼ 100 SNPs in the model. The95% confidence interval is derived from the LOESS fit. LCL: lower confidencelimit. UCL: upper confidence limit. The proportion <strong>of</strong> explained phenotypicvariance assumes population prevalence K = 0.54%. . . . . . . . . . . . . . . . 1347.1. The marginal and conditional independence statements that can be derivedfrom the (SNP, Gene, Metabolite) graph, and the corresponding correlationand partial correlations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1497.2. The stable predictive genes selected <strong>for</strong> each metabolic cluster (appeared inthe lasso model ≥ 60% <strong>of</strong> the cross-validation replications). “-” indicates thatno genes were stably selected in this cluster. . . . . . . . . . . . . . . . . . . . . 1577.3. trans-QTLs <strong>for</strong> genes associated with the metabolites predictive <strong>of</strong> fasting glucoselevels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1647.4. Genomic inflation factors <strong>for</strong> genes associated with metabolites predictive <strong>of</strong>fasting glucose, based on the median χ 2 statistics from the linear model <strong>of</strong>association in PLINK. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167A.1. <strong>of</strong> external-validation AUC different numbers <strong>of</strong> features. The AUC <strong>for</strong> individualgenes is used as the intercept. . . . . . . . . . . . . . . . . . . . . . . . . . 215xxvi


List <strong>of</strong> TablesB.1. The confusion matrix <strong>of</strong> predicted versus actual classes. “True” is truly causalSNPs, “False” is non-causal SNPs, Ŷ = 1 and Ŷ = 0 are predictions <strong>of</strong> causaland non-causal SNPs, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . 217B.2. Genomic inflation factors λ estimated by PLINK v1.07 using the median <strong>of</strong>statistics <strong>for</strong> either the 1-df χ 2 test (--assoc --adjust) or the logistic regressiontest (without covariates, --logistic --adjust). . . . . . . . . . . . . . . 219B.3. Population prevalence <strong>for</strong> each disease as used in this work. . . . . . . . . . . . 221B.4. AUC and proportion <strong>of</strong> phenotypic variance explained <strong>for</strong> GCTA (Yang etal., 2011), using 3-fold cross-validation (CV). AUC was derived from the persamplescores in the test folds <strong>for</strong> each cross-validation fold. The 95% confidenceinterval is from a one-sample t-test, and explained variance (includingthe confidence intervals) is estimated from the AUC and prevalence K usingthe method <strong>of</strong> Wray et al. (2010). The column denoted N is the number <strong>of</strong>AUC values estimated in cross-validation — each 3CV produces N = 3 AUCvalues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223B.5. AUC estimated in 10 × 10-fold cross-validation on chr6 in the Celiac1 dataset(2200 samples, 19,169 SNPs). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224B.6. BD dataset, autosomes only. Prevalence from Bebbington and Ramana (1995);Wray et al. (2010). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243B.7. CAD dataset, autosomes only. Prevalence from Wray et al. (2010). . . . . . . 244B.8. Celiac datasets, autosomes only. Prevalence from van Heel and West (2006). 245B.9. Crohn’s dataset, autosomes only. Prevalence from Carter et al. (2004); Wrayet al. (2010). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247B.10.HT dataset, autosomes only. Prevalence from NHS (2010). . . . . . . . . . . . 248B.11.RA dataset, autosomes only. Prevalence from Wray et al. (2010). . . . . . . . 249B.12.T1D dataset, autosomes only. Prevalence from Wray et al. (2010). . . . . . . 250B.13.T2D dataset, autosomes only. Prevalence from Wray et al. (2010). . . . . . . 251xxvii


List <strong>of</strong> AbbreviationsAUCFNFPGEOGiBGOarea under receiver operating characteristic curvefalse negativefalse positiveGene Expression Omnibusgebi byte, 2 30 bytesGene OntologyGWAS Genome-<strong>wide</strong> association studyLDMbMiBlinkage disequilibriummega basemebi byte, 2 20 bytesmmol/L millimolar per litreMSENPVPCAPPVPRCROCSNPSVMTiBTNTPmean squared errornegative predictive valueprincipal component <strong>analysis</strong>positive predictive valueprecision-recallreceiver operating characteristicSingle nucleotide polymorphismSupport vector machineTebi byte, 2 40 bytestrue negativetrue positiveWTCCC Wellcome Trust Case-Control ConsortiumeQTL expression quantitative locixxix


1IntroductionThe development <strong>of</strong> high throughput technologies <strong>for</strong> assaying the molecular characteristics <strong>of</strong>tissues and cells has been trans<strong>for</strong>ming the biological sciences <strong>for</strong> the past decade or so, makingthem increasingly quantitative. Modern technologies such as gene expression microarrays,single nucleotide polymorphism (SNP) microarrays, epigenetic marker arrays, whole <strong>genome</strong>sequencing, and high throughput metabolomics, all generate a wealth <strong>of</strong> data. These datasetshave been immensely useful <strong>for</strong> characterising genetic, transcriptomic (gene expression), andmetabolomic variation across individuals and populations, and relating this variation to observedphenotypes such as disease. For example, gene expression datasets allow us to assesshow predictive is gene expression <strong>of</strong> phenotypes such as breast cancer metastasis (van ’t Veeret al., 2002), and which genes and pathways are responsible <strong>for</strong> these cellular processes. WithSNP datasets, we can assess which SNPs are strongly associated with disease, which genesare likely affected by these SNPs, and more generally evaluate the genetic architecture <strong>of</strong> eachphenotype (Manolio et al., 2009) and the strength <strong>of</strong> the genetic component in the overallobserved variation <strong>of</strong> the phenotype.Once the non-trivial technical challenge <strong>of</strong> obtaining the data has been overcome, the nextchallenge is extracting useful and relevant biological in<strong>for</strong>mation from the data. Due tothe large size and complexity <strong>of</strong> such datasets, manually detecting and interpreting patternsin them is beyond the ability <strong>of</strong> any <strong>human</strong>, and data <strong>analysis</strong> is increasingly relying oncomputational and statistical methods to extract meaningful insight, either <strong>for</strong> diagnosticpurposes or <strong>for</strong> generating hypotheses <strong>of</strong> the underlying biology that can later be verified inthe lab. This thesis deals with the computational and statistical aspects <strong>of</strong> modelling large1


1. Introductionmolecular marker data, across several domains, including gene expression data, SNP data,and metabolite data, focusing on efficient and effective methods <strong>for</strong> analysing the data, whilemaintaining biological interpretability.More broadly, the topics <strong>of</strong> this thesis are related to the goal <strong>of</strong> “personalised medicine”,which aims to diagnose and treat patients based on their own genomic in<strong>for</strong>mation, at a levelfar more specific and detailed than has been previously possible by relying solely on traditionalclinical variables such as age, sex, and disease symptoms. One such example is the use <strong>of</strong>genomic pr<strong>of</strong>iles to identify cancer subtypes and consequently to prescribe different drugsto cancer patients based on their subtype (Chin et al., 2011; Schilsky, 2010). Personalisedmedicine presents technical challenges <strong>for</strong> current methods in computer science and machinelearning (Fernald et al., 2011): First, the large size <strong>of</strong> the many datasets requires carefuldesign <strong>of</strong> algorithms <strong>for</strong> processing and storing the data. Second, interpreting any patternsor associations in terms <strong>of</strong> function and effect on the phenotype is challenging. Third, thereis the challenge <strong>of</strong> integrating diverse data types, each capturing a unique aspect <strong>of</strong> the data,into a coherent model <strong>of</strong> the underlying biological processes. Finally, there is the challenge<strong>of</strong> translating mathematical models to clinically relevant and actionable insights.Main Themes <strong>of</strong> This ThesisThere are several key challenges with computational and statistical <strong>analysis</strong> <strong>of</strong> molecularbiology data, which motivate the ideas developed in this thesis.Prediction We employ statistical models <strong>of</strong> molecular marker data trained on data where thephenotype is known (supervised learning). The main criteria we use to evaluate our statisticalmodels is their predictive ability with respect to the phenotype being modelled: how well doesa model predict the phenotype given some input such as gene expression or SNPs? Predictiveability quantifies the degree <strong>of</strong> association between the inputs (genotypes or gene expression)and the outputs (phenotypes). Competing models are compared by assessing their predictiveability. High predictive ability can also be useful from the practical sense, <strong>for</strong> example, <strong>for</strong>clinical diagnosis <strong>of</strong> which patients are at higher risk <strong>of</strong> breast cancer metastasis and relapse(discussed in Chapter 4). Predictive ability can be measured using different statistics: <strong>for</strong>classification we might employ accuracy, area under receiver operating characteristic curves,and precision recall curves, whereas <strong>for</strong> regression we might use the mean square error orR 2 . There<strong>for</strong>e, the choice <strong>of</strong> the appropriate predictive measure is an important part <strong>of</strong> themodelling process.Interpretability Beyond predictive ability, having interpretable models is an important consideration,as one <strong>of</strong> the main goals <strong>of</strong> biology is to generate plausible mechanistic explanations<strong>of</strong> the cellular processes underlying health and disease. A model that has very high predictive2


ability but is difficult to interpret may be less useful than a less predictive model that isamenable to biological interpretation and provides insights into disease etiology. Hence, interpretationis another theme <strong>of</strong> this thesis. All the models we employ in this thesis are linearmodels (or trans<strong>for</strong>mations <strong>of</strong> linear models, such as logistic regression), where the weightsgiven to each input (marker) are directly interpretable as their contributions to the overallmodel. Particularly, we use lasso models (Chapters 5, 6, and 7), which are sparse models,leading to a selection <strong>of</strong> a relatively small number <strong>of</strong> markers that enter the model with anon-zero weight. This is in contrast with other <strong>approaches</strong>, <strong>for</strong> example kernel methods,which might achieve good prediction but where interpretation in terms <strong>of</strong> the contributions<strong>of</strong> individual inputs is difficult.Scalability A good modelling approach is <strong>of</strong> no practical use if it cannot be applied toreal data. The modelling <strong>approaches</strong> should be computationally tractable and scalable, asdatasets are rapidly increasing in size both in terms <strong>of</strong> samples and in terms <strong>of</strong> markers. Forexample, <strong>genome</strong>-<strong>wide</strong> association studies now routinely include tens <strong>of</strong> thousands <strong>of</strong> samplesassayed over more than half a million single nucleotide polymorphism (SNP) markers, andarray sizes <strong>of</strong> around two million markers are now becoming available. The algorithms weuse to fit the models in this thesis are efficient and scalable to large datasets, allowing usto successfully model current SNP datasets involving thousands <strong>of</strong> samples and hundreds <strong>of</strong>thousands <strong>of</strong> markers, where other <strong>approaches</strong> may fall short or require much more computationalresources.Data Integration Multiple types <strong>of</strong> data can be assayed over the same samples. For example,The Cancer Genome Atlas (TCGA, http://cancer<strong>genome</strong>.nih.gov), contains SNP, gene expression,copy number, methylation, micro-RNA, and other datasets assayed over the samesamples, <strong>for</strong> several cancer types. While each dataset provides valuable in<strong>for</strong>mation about theunderlying biological mechanisms <strong>of</strong> disease, these are disparate views <strong>of</strong> the same cellularprocesses. Potentially greater insight is produced by integrating the data into a coherentbiological model, where health and disease are considered to be outcomes <strong>of</strong> complex interactionsbetween genetic variation, epigenetics, and environment. Chapter 7 is a case study,where we employ sparse linear models to per<strong>for</strong>m an integrated <strong>analysis</strong> <strong>of</strong> the DILGOMmulti-omic dataset (Inouye et al., 2010b), consisting <strong>of</strong> SNPs, gene expression, metabolites,and clinical variables, inferring causal networks <strong>of</strong> SNPs and genes affecting metabolite levelsand associated with fasting glucose, a clinical marker <strong>for</strong> type 2 diabetes. The DILGOMdataset is currently one <strong>of</strong> the largest datasets <strong>of</strong> this type. Another <strong>for</strong>m <strong>of</strong> data integrationis <strong>analysis</strong> <strong>of</strong> datasets where multiple correlated phenotypes are assayed over the samesamples. Algorithms <strong>for</strong> <strong>analysis</strong> <strong>of</strong> correlated phenotypes are discussed in Chapter 8.3


1. IntroductionThesis Outline and ContributionsChapter 2 — Biological Background is a review chapter, introducing the basic biologicalconcepts used in this thesis, beginning with topics such as the central dogma <strong>of</strong> molecularbiology, how gene expression is measured using microarrays, and challenges in <strong>analysis</strong> andinterpretation <strong>of</strong> gene expression experiments. We then discuss genetic data, particularlysingle nucleotide polymorphism (SNP) data. We highlight several competing hypotheses <strong>for</strong>the genetic basis <strong>of</strong> disease, and discuss <strong>genome</strong> <strong>wide</strong> association studies (GWAS) which areunbiased scans <strong>of</strong> SNP data <strong>for</strong> association with phenotypes. Finally, we discuss the problem<strong>of</strong> missing heritability, and expression quantitative trait loci (eQTL) where gene expressionitself is used as a phenotype in a GWAS.Chapter 3 — Review <strong>of</strong> the Analysis <strong>of</strong> Gene Expression and Genetic Data discusses some<strong>of</strong> the main concepts in statistics and supervised machine learning that <strong>for</strong>m the basis <strong>for</strong> therest <strong>of</strong> the work in this thesis. We emphasise the roles <strong>of</strong> feature selection, and especially <strong>of</strong>sparse penalised methods such as the lasso, which are used heavily throughout the thesis <strong>for</strong>modelling data.Chapter 4 — Prediction <strong>of</strong> Breast Cancer Prognosis using Gene Set Statistics discussesthe problem <strong>of</strong> predicting future breast cancer metastasis and relapse based on gene expressiondata. This problem has been the focus <strong>of</strong> intense research, as successful prediction <strong>of</strong>which women are at higher risk would have important clinical implications <strong>for</strong> many thousands<strong>of</strong> women around the world, allowing more personalised treatment <strong>of</strong> the most commoncancer in Western women. While gene expression has been found to be moderately predictive<strong>of</strong> metastasis risk, the prognostic genes found by different studies have so far been largelyinconsistent, raising doubts about the interpretation <strong>of</strong> these results and the underlying biologicalmechanisms. We propose an approach based on gene sets, rather than individualgenes, where membership <strong>of</strong> genes in a set is based on prior knowledge, such as the literature,large scale experiments, and curated pathways. The gene expression levels in each set areaggregated using a set statistic. The set expression levels are then used as inputs <strong>for</strong> standardclassification models. We evaluate five breast cancer datasets, comparing the set approachwith the standard approach based on individual genes. Our contributions include:• We propose a gene set statistic framework, which produces gene signatures based onsets <strong>of</strong> genes rather than individual genes.• We apply our method to five independent breast cancer datasets, evaluating multiplevariants <strong>of</strong> the gene set method.• We demonstrate that the gene set approach produces more robust and consistent prognosticsignatures than those based on individual genes.4


• We show that the top predictive sets are highly biologically interpretable, consisting<strong>of</strong> genes belonging to known pathways associated with the cell cycle and metastasisprocesses.Chapter 5 — Fast and Memory-Efficient Sparse Linear Models deals with the problem<strong>of</strong> fitting sparse (lasso) linear statistical models to large genetic datasets. Such models canpotentially be useful <strong>for</strong> diagnostic purposes by predicting disease from genotype, <strong>for</strong> identifyingthe SNPs associated with the phenotype, and <strong>for</strong> estimating how much <strong>of</strong> the variabilityin the phenotype is due to genetic factors. However, existing general-purpose <strong>approaches</strong>are not well suited <strong>for</strong> this problem, as the datasets are large and require large amounts <strong>of</strong>RAM to complete. We propose an efficient algorithm, named SparSNP, which operates inan out-<strong>of</strong>-core fashion, allowing it to rapidly fit lasso classification and regression models tolarge SNP data, while using low amounts <strong>of</strong> memory. We compare our approach with severalother state <strong>of</strong> the art methods <strong>for</strong> fitting lasso models, using a case/control celiac diseasedataset. Our contributions include:• We develop an out-<strong>of</strong>-core algorithm <strong>for</strong> efficiently fitting lasso models to large SNPdata, either <strong>for</strong> classification or <strong>for</strong> regression.• We evaluate our implementation sparsnp, compared with several state <strong>of</strong> the art methodson real genetic data, demonstrating that our method is faster than existing <strong>approaches</strong>and more scalable in terms <strong>of</strong> memory requirements.Chapter 6 — Sparse Linear Models Explain Phenotypic Variation and Predict Risk <strong>of</strong> ComplexDisease applies the lasso models developed in Chapter 5 to the problem <strong>of</strong> case/controlprediction in eight datasets <strong>of</strong> <strong>human</strong> complex disease, such as type 1 and 2 diabetes and celiacdisease. Our contributions include:• We compare lasso models with univariable methods <strong>for</strong> the task <strong>of</strong> detecting associatedgenetic variants with univariable methods, and analyse the strengths and weaknesses <strong>of</strong>each approach.• We per<strong>for</strong>m an <strong>analysis</strong> using lasso models <strong>of</strong> eight complex <strong>human</strong> diseases, includingceliac disease, type 1 and 2 diabetes, Crohn’s disease, bipolar disorder, hypertension,rheumatoid arthritis, and coronary artery disease. For each disease, we characterisehow well it can be predicted from the genetic data and how much phenotypic variancecan be explained by the data.• We propose that celiac disease and type 1 diabetes may have genetic subtypes thatexhibit more genetic predictability than others, and assess the importance <strong>of</strong> thesefindings in the population-<strong>wide</strong> setting.5


1. IntroductionChapter 7 — Characterising the Genetic Control <strong>of</strong> Human Metabolic Genes leveragesthe lasso models developed earlier in the setting <strong>of</strong> prediction <strong>of</strong> quantitative traits. Weper<strong>for</strong>m an integrated <strong>analysis</strong> <strong>of</strong> a mult-omic dataset (DILGOM) consisting <strong>of</strong> SNPs, geneexpression, clinical variables, and serum metabolites, with the aim <strong>of</strong> deriving insights intothe genetic control <strong>of</strong> metabolites mediated by genes. Our contributions include:• We identify genes highly associated with metabolite levels, characterising the degreeto which each metabolite can be predicted. Many <strong>of</strong> these genes are known to beassociated with metabolism, however, we also identify genes with strong associationsbut previously unknown function.• We identify and characterise SNPs that are likely to regulate some <strong>of</strong> these predictivegenes, both in cis and in trans.• We associate fasting glucose levels, a clinical marker <strong>for</strong> type 2 diabetes, with metabolites,and infer causal networks <strong>of</strong> genetic regulation <strong>for</strong> these metabolites, mediated bygene expression. These networks represent novel hypotheses that may explain some <strong>of</strong>the genetic basis <strong>of</strong> type 2 diabetes.Chapter 8 — Fused Multitask Penalised Regression proposes a novel statistical framework<strong>for</strong> regression and classification in the multi-task (multiple phenotype) setting, termedFused Multitask Penalised Regression (fmpr). Examples <strong>of</strong> such settings include prediction<strong>of</strong> multiple metabolite levels from gene expression and prediction <strong>of</strong> expression <strong>of</strong> multiplegenes from genetic variants. Our method leverages the correlations between outputs to producesparse models, assuming that correlated outputs are due to shared inputs. In contrast,most existing methods, such as the lasso, ignore such relatedness and treat each phenotypeseparately. Our contributions include:• A novel multitask model <strong>for</strong> modelling correlated outputs.• An algorithm and efficient implementation <strong>of</strong> our multi-task method.• A comparison <strong>of</strong> our method with existing single-task and multi-task methods in simulation,demonstrating the usefulness <strong>of</strong> our approach both in prediction accuracy andin recovering the true causal inputs.• An evaluation <strong>of</strong> our method on real data involving gene expression and metabolites,demonstrating that our approach results in better predictive models than the lasso.Chapter 9 — Conclusionsbe extended in the future.concludes this thesis, and discusses ways in which this work can6


SummaryIn summary, this thesis is concerned with effective and efficient supervised learning methods<strong>for</strong> modelling molecular marker data, such as gene expression levels, metabolite levels, andSNPs. This work shows the feasibility <strong>of</strong> sparse models <strong>for</strong> <strong>analysis</strong> <strong>of</strong> large datasets, and theutility <strong>of</strong> these models <strong>for</strong> modelling <strong>human</strong> gene expression and SNP data, both <strong>for</strong> prediction<strong>of</strong> the phenotype and <strong>for</strong> biological interpretation <strong>of</strong> the underlying cellular mechanisms. Wealso demonstrate the increased biological insight gained from an integrated <strong>analysis</strong> <strong>of</strong> multiomicdatasets, combining multiple sources <strong>of</strong> data assayed on the same samples. Overall,this work advances the possibility <strong>of</strong> developing better predictive models <strong>of</strong> <strong>human</strong> disease,bringing the potential benefits <strong>of</strong> personalised medicine closer to realisation.7


2Biological BackgroundIn this chapter we survey some <strong>of</strong> the basic biological concepts and terminology used in thisthesis. Our discussion mainly uses <strong>human</strong> disease as the phenotype <strong>of</strong> interest, but many <strong>of</strong>the underlying mechanisms hold more generally across other phenotypes and other organisms.In addition, we limit our discussion to the genomic components <strong>of</strong> disease, and do not examineenvironmental factors or interactions <strong>of</strong> genes with the environment.2.1. The Central Dogma <strong>of</strong> Molecular BiologyThe central dogma <strong>of</strong> molecular biology is that in<strong>for</strong>mation in the cell flows in one direction,starting with DNA, as illustrated in Figure 2.1. DNA (deoxyribonucleic acid) is composed<strong>of</strong> four bases: A (adenine), G (guanine), C (cytosine), and T (thymine). DNA is dividedinto two types <strong>of</strong> regions — genic and intergenic, where the <strong>for</strong>mer contains genes and thelatter do not. The genic regions are composed <strong>of</strong> excised regions (exons) and interveningregions (introns). The first step in the in<strong>for</strong>mation flow, there<strong>for</strong>e, is when exons are excisedto precursor-mRNA which is then spliced to <strong>for</strong>m messenger RNA (ribonucleic acid). Second,the ribosomes translate the transcribed mRNA sequence to protein sequence by chainingtogether amino acids, one <strong>for</strong> each mRNA molecule. Third, the completed protein is releasedand is then free to per<strong>for</strong>m some intra-cellular or extra-cellular action, <strong>for</strong> example as anenzyme or as a structural protein in the cell. Note that the entire process <strong>of</strong> expressionbegins with “step zero”, when a protein called a transcription factor (TF), binds to thepromoter region <strong>of</strong> a gene, initiating a complex chain <strong>of</strong> events in which the exons are excised9


2. Biological BackgroundDNAmRNAproteinphenotypeFigure 2.1.: The Central Dogma <strong>of</strong> molecular biology.and mRNA is produced. The system is self regulating — DNA codes <strong>for</strong> transcription factorsthat bind to DNA and modulate the expression <strong>of</strong> other genes and their protein products,including other transcription factors, thus creating a closed loop, an essential component <strong>of</strong>self regulation (Alon, 2007).A mutation in the DNA, due, <strong>for</strong> example, to imperfect replication or to ionising radiation,may change the amino acid sequence <strong>of</strong> the protein produced, or may lead to under- or overproduction<strong>of</strong> certain proteins. These molecular-level conditions may manifest through whatwe perceive as disease. A well-known example is the mutation in the gene HBB in <strong>human</strong>s,causing misfolding <strong>of</strong> the protein beta-globin, which in turn manifests as the disease knownas sickle cell anaemia 1 .Although this four-step model <strong>of</strong> cellular in<strong>for</strong>mation flow is known to be a crude oversimplification<strong>of</strong> a complex reality, it is a useful mental model nonetheless, as long as we aremindful <strong>of</strong> its limitations. We expand this basic model throughout this chapter, incorporatingother known mechanisms <strong>of</strong> in<strong>for</strong>mation flow.2.2. The Molecular Basis <strong>for</strong> DiseaseTo date, there have been two major ef<strong>for</strong>ts in the search <strong>for</strong> cellular phenomena associatedwith disease. The first, which we may term “transcriptomic”, has been in the area <strong>of</strong> geneexpression. Under the central dogma, over- or under-expressed genes are an important step ina complex molecular cascade eventually leading to what we perceive as disease. There<strong>for</strong>e, inthe transcriptomic approach, we search <strong>for</strong> associations <strong>of</strong> genes with some observed phenotype1 http://www.ncbi.nlm.nih.gov/omim/60390310


2.2. The Molecular Basis <strong>for</strong> Diseasesuch as the disease itself or some other clinical measurement (<strong>for</strong> example, tumour size orsurvival time), with the aims <strong>of</strong> implicating genes that might be responsible <strong>for</strong> the conditionand potentially developing diagnostic and prognostic tools. Since gene expression manifestsitself through mRNA in the cell, what we actually measure is mRNA levels, using geneexpression microarrays.The second ef<strong>for</strong>t, which we term “genetic”, has been to characterise which DNA lociharbour variations that are strongly associated with the phenotype. Out <strong>of</strong> the possibleDNA variations, we concentrate on single-nucleotide polymorphisms (SNPs). Other variationsinclude copy number variations and chromosomal inversions.SNPs are single-base DNA loci that have several variants, usually two (biallelic). SNPs canoccur anywhere in the DNA: in intergenic regions, in exons, and in introns. The variationmay or may not have any effect in the cell, depending on the functional importance <strong>of</strong> whereit occurred and the nature <strong>of</strong> the variation itself — exonic variations that do not change theresulting protein are synonymous, whereas those that alter the protein are non-synonymous.There are significant differences between the transcriptomic and genetic <strong>approaches</strong>. Incontrast with a SNP that is a physical variation in one DNA nucleotide, a gene is a conceptualannotation <strong>of</strong> a region <strong>of</strong> DNA, typically thousands <strong>of</strong> nucleotides long, that usuallyencodes <strong>for</strong> a protein. Each gene has its own regulatory region, the promoter, where transcriptionfactors can bind and regulate the gene’s expression. There<strong>for</strong>e, the difference betweenanalysing gene expression and analysing SNPs is that with the <strong>for</strong>mer we are asking howactive is the gene? whereas with the latter we are asking which variant <strong>of</strong> the DNA do wehave? and, assuming that the SNP is known to affect the gene, which variant <strong>of</strong> the gene dowe have? Another important difference between the transcriptomic and genetic <strong>approaches</strong>is that mRNA levels are merely snapshots <strong>of</strong> the highly dynamic cellular activities and aredependent on factors such as which tissue was measured and time. Changes to mRNA levelsoccur at time-scales ranging from minutes to days or months (Alon, 2007), depending onwhich cellular mechanism they belong to. In contrast, (non-somatic) genetic mutations arebelieved to be largely immutable once they have occurred, are passed from one generationto the next unless they are highly lethal, and occur over time scales <strong>of</strong> multiple generations,which means decades <strong>for</strong> <strong>human</strong>s. Currently, there are estimated to be about 30,000 <strong>human</strong>genes, whereas the HapMap (International HapMap 3 Consortium, 2010; InternationalHapMap Consortium, 2005, 2007) and 1000Genomes (1000 Genomes Project Consortium,2010) projects have to date mapped more than 30 million <strong>human</strong> SNPs, and the number islikely to rise as more diverse <strong>human</strong> populations are sequenced.These differences between transcriptomic and genetic data, together with the fact that basicgenetic theory dates back to Mendel in the 19th century, be<strong>for</strong>e the discovery <strong>of</strong> the DNAand the understanding <strong>of</strong> basic genomic mechanisms in the 1950s, have meant that transcriptomicand genetic data have largely been analysed separately and using different methods, as11


2. Biological Backgrounddiscussed in Chapter 3. Recently, more integrative <strong>approaches</strong> have been proposed, such as<strong>analysis</strong> <strong>of</strong> expression quantitative loci (eQTL), that integrate transcriptomics and geneticsin order to produce a clearer picture <strong>of</strong> cellular mechanisms; we discuss these in Section 2.4.8.2.3. Gene ExpressionGenes do not affect phenotype directly. Rather, their effects are mediated by mRNA andprotein. Since DNA mutations are known to cause changes in observed phenotypes in organisms,it is reasonable to examine whether any genes exhibit changes in activity patternsthat can be associated with phenotype. Gene activity, termed gene expression, is inferredthrough measuring the levels <strong>of</strong> mRNA in the cell. In the simplest conceptual model, genesare thought <strong>of</strong> as being “ON” (highly expressed) or “OFF” (low or basal expression). Moresophisticated models may be more fine grained, considering genes on a continuous scale. Differentgenes show different basal levels <strong>of</strong> expression and different dynamic ranges. There<strong>for</strong>e,gene expression levels are usually not compared directly, but are expressed as difference inexpression between two phenotypic states — differential expression. Genes with significantdifferential expression between two conditions (case/control) are candidates <strong>for</strong> the genes thatcarry the mutation or are otherwise potentially important in the underlying molecular mechanismleading to the phenotype. Statistical <strong>analysis</strong> <strong>of</strong> differential expression is discussedfurther in Chapter 3.2.3.1. Measuring Gene ExpressionGene expression microarrays are the tools used to measure relative or absolute concentrations<strong>of</strong> mRNA in tissue, and thus to infer gene activity levels. The arrays themselves are made <strong>of</strong>thin pieces <strong>of</strong> glass or plastic, and on their surface there are thousands <strong>of</strong> spots with short DNAsequences. Early microarrays were usually <strong>of</strong> the two-channel spotted cDNA (complementaryDNA) type, usually custom-made, whereas modern microarrays are typically commercialhigh density oligonucleotide arrays, such as those produced by Affymetrix (Santa Clara, CA,USA). Whereas spotted arrays tend to contain several thousand relatively long probes, modernmicroarrays contain tens <strong>of</strong> thousands <strong>of</strong> probesets; the <strong>human</strong> HG-U133plus2.0 by Affymetrixcontains roughly 50,000 short probesets. A probeset is a set <strong>of</strong> 16–20 short DNA probesused to measure one transcript. Each probeset contains perfect match (PM) and mismatchprobes (MM), designed to enable calibration <strong>of</strong> expression accounting <strong>for</strong> effects such asnon-specific binding (see Section 2.3.2). With all arrays there is not necessarily a one-toonecorrespondence between genes and probes, and post-processing is needed to map fromprobesets to genes.As outlined in Figure 2.2, the basic stages <strong>of</strong> a microarray experiment are:12


2.3. Gene Expression• mRNA extraction mRNA is retrieved from tissue samples, either from samples suchas biopsies or blood, or from cell lines. Since gene expression levels differ between tissues,mRNA must be extracted from the relevant tissue <strong>for</strong> each experiment. Once mRNAhas been extracted, complementary copy <strong>of</strong> the mRNA is created from it (A→T, G→C,and vice versa). For spotted arrays, cRNA samples from two phenotypes (<strong>for</strong> example,tumour and normal) are treated slightly differently; samples from one phenotype arelabelled with red fluorescent dye (Cy5), whereas those from the other phenotype arelabelled with green (Cy3). cRNA <strong>for</strong> oligonucleotide arrays undergoes biotinylation(labelling with biotin).• Hybridisation and washing The labelled cRNA is hybridised to the arrays, a processin which some <strong>of</strong> the cRNA binds to the probes on the array and the remaining unboundmaterial is washed <strong>of</strong>f. When the sample contains more cRNA, more <strong>of</strong> it binds to theprobes, and vice-versa. For oligonucleotide arrays, this stage includes staining withstreptavidin-phycoerithrin which binds to the biotin. For spotted arrays, both matchedsamples, red and green, are hybridised to the same array.• Measurement The arrays are scanned using a laser scanner that measures the amount<strong>of</strong> fluorescence at each spot on the array. The fluorescence is proportional to the number<strong>of</strong> cRNA molecules bound to the spot. For Affymetrix arrays, the image is stored asa CEL file, which contains image intensities and various experimental parameters. Ingeneral, oligonucleotide arrays measure absolute hybridisation intensities, and at leasttwo arrays, one <strong>for</strong> each sample, are needed <strong>for</strong> the purpose <strong>of</strong> estimating differencesin gene expression (differential expression). In contrast, spotted cDNA arrays measurerelative hybridisation, since each at spot both the Cy3 and Cy5 labelled cRNA binds,albeit at potentially different levels. The ratio <strong>of</strong> hybridisation intensities is then usedto estimate differential expression.• Preprocessing After raw intensities have been determined, a crucial step in <strong>analysis</strong> <strong>of</strong>gene expression microarrays is preprocessing <strong>of</strong> the data. Preprocessing involves qualitycontrol in which arrays with many faulty probes are discarded, and normalisation, whichis statistical correction <strong>for</strong> potentially confounding but biologically uninteresting artefactsdue to measurement variation within and between the arrays (Smyth and Speed,2003) (see Section 2.3.2 <strong>for</strong> further discussion). A common post-processing method <strong>for</strong>oligonucleotide arrays is the Robust Multichip Average (RMA) (Bolstad et al., 2003;Irizarry et al., 2003a,b).• Analysis After preprocessing, gene expression data is converted to log 2 intensities,since this reduces the dynamic range <strong>of</strong> the data and makes the noise (variability) approximatelynormally distributed. Now the data can be used <strong>for</strong> tasks such as searching13


2. Biological BackgroundGlass slide arrayMicroarrays <strong>for</strong> gene expressionFJT Staal et alAffymetrix Gene Chip1325RNA extractionIVTCTP-Cy3Cy3-labeledcRNAIVTCTP-Cy5Cy5-labeledcRNAcDNA reaction,purification andlabeling by IVTfragmentation2+(heat + Mg )IVTUTP-biotinCTP-biotinbiotin-labeledcRNAlabeled cRNAfragmentsglass slidearray(one cDNA or longoligonucleotidesper gene)hybridizationwashing+ staining withstreptavidin-PEAffymetrix array(multiple shortoligonucleotidesper gene)laser scanninggeneexpressionratioscomputeranalyses“absolute” geneexpression levelsBIOINFORMATICSFigure 1 Comparison <strong>of</strong> glass slide and Affymetrix microarray procedures. For glass slide experiments, two cell populations, <strong>for</strong> instancediseasedFigureand normal,2.2.:areAnisolated, outlineRNA<strong>of</strong> is extracted,the geneandexpression cDNA is made,microarraywhich is used <strong>for</strong>experiment,in vitro transcription<strong>for</strong>(IVT)spottedwith Cy3cDNA(green)(left)or Cy5 (red)labeled nucleotides. The and twooligonucleotide labeled cRNA samples arrays are mixed (right). and hybridized Reprinted on a glass byslide permission array, which is from scanned Macmillan with a laser, Pub-used in anLtd IVT reaction (Staal to generate et al., biotinylated 2003), copyright cRNA. After fragmentation, (2003). this cRNA is hybridized to microarrays, washedfollowed bycomputer <strong>analysis</strong> <strong>of</strong> the intensity image. With Affymetrix arrays, one population is used as starting material. Total RNA is extracted and cDNA isprepared. The cDNA islishersand stained with PE-conjugated streptavidin, and subsequently scanned on a laser scanner.Two different subtypes <strong>of</strong> glass slide microarrays can bediscerned, based on the use <strong>of</strong> cDNAs or oligonucleotides. ThecDNA microarrays have been around <strong>for</strong> some time and are<strong>of</strong>ten produced by spotting PCR products on glass, latertrans<strong>for</strong>med into single-strand (ss) DNA products by treatmentwith alkali expression or light. Given data. the problems with reproduciblespotting <strong>of</strong> products <strong>of</strong> different lengths and generating ssDNAproducts, glass slides with oligonucleotides have been produced.These oligonucleotides are typically 70 mers or 80 mers.This is much longer than the 25 mers used by Affymetrix, butgenerally only one oligonucleotide is used per gene without anyhybridization control.Glass slide microarray technology is readily amendable <strong>for</strong>relatively large numbers <strong>of</strong> samples, not as expensive asAffymetrix technology, and can be set up by investigatorsthemselves using an array spotter. However, the technicaldifficulties in the reproducible production <strong>of</strong> one’s own glassslide microarrays should not be underestimated: conditions suchas moisture, temperature, and light intensities in the room <strong>of</strong> thespotter need to be carefully controlled, large numbers <strong>of</strong>oligonucleotides <strong>of</strong> similar quality need to be present, and inthe case <strong>of</strong> cDNA glass slides roughly similar hybridizationconditions <strong>for</strong> each cDNA need to be found.<strong>for</strong> genes differentially expressed between two conditions, clustering <strong>of</strong> genes and samplesinto groups to find novel disease subtypes, and classification <strong>of</strong> the phenotype basedon the gene expression. See Chapter 3 <strong>for</strong> more discussion <strong>of</strong> methods <strong>for</strong> analysing geneAffymetrix microarrays2.3.2. Challenges in Analysis <strong>of</strong> Gene Expression MicroarraysAffymetrix microarrays (the so-called GeneChips 7 ) generate agene expression pr<strong>of</strong>ile <strong>of</strong> one sample and there<strong>for</strong>e use onlyone color (phycorerythrin, PE, red). The design <strong>of</strong> the GeneChipsis such that expression <strong>of</strong> a gene is interrogated by several (11–20) 25-mer oligonucleotides that span a part <strong>of</strong> the gene(Figure 2). In addition to these perfect-match oligonucleotides,each 25 mer comes with a negative control oligonucleotide thatcontains a mismatch at position 13. The integration <strong>of</strong> theexpression levels <strong>for</strong> each <strong>of</strong> the 11–20 perfect-match–mismatcholigonucleotide sets generates a value <strong>for</strong> the expression <strong>of</strong> aparticular gene.There are several sources <strong>of</strong> error, variation, and confounding in gene expression experiments.These sources can largely be classified into two groups, extrinsic and intrinsic. Extrinsic factorsare ones that are not biologically insightful and potentially weaken or confound statistical14Leukemia


2.3. Gene Expressionanalyses, and should be eliminated as far as possible prior to <strong>analysis</strong>. Some extrinsic factorsaffecting gene expression experiments are:• Hybridisation noise and batch effects Hybridisation <strong>of</strong> cRNA to the arrays is achemical process and as such it is stochastic and dependent on external factors suchas temperature, sample age, and experimental conditions such as the concentrations <strong>of</strong>the reagents used in the process. The stochasticity implies that the measured amount<strong>of</strong> hybridisation varies between experiments, even when measured under very similarconditions. In addition, if different samples in a study were analysed under differentconditions or in different labs, then there can be batch effects in the data, that is,systematic differences in expression levels attributable to external factors, rather thandue to intrinsic differences in the expressed genes. Depending on their magnitude anddirection, batch effects can mask gene effects or create spurious gene effects in thedata (<strong>for</strong> example, in the extreme case where all cases are analysed in one lab andall controls in another). Hybridisation noise is typically mitigated by strict laboratoryprotocols and the averaging the data over a large number <strong>of</strong> samples. Batch effectscan be prevented through careful experimental design, such as randomising the split <strong>of</strong>samples between labs, or by strict <strong>analysis</strong> <strong>of</strong> the samples in the same lab under similarconditions. If batch effects are already present in the data, statistical correction can beused to some degree, however, it is more difficult than preventing the problem in the firstplace since the <strong>analysis</strong> requires making assumptions about the sources <strong>of</strong> variabilityand their relative importance, and these sources are not always well known. A lesscommon solution is to use technical replicates — repeated arrays <strong>of</strong> the same tissuesample, sometimes analysed by different labs. Technical replicates can be averaged to<strong>for</strong>m a more stable estimate <strong>of</strong> the hybridisation <strong>for</strong> that sample.• Measurement noise The hybridised cRNA is stained or labelled with fluorescent dye,which is detected by the laser, and converted to a digital image. This process is notperfect, and errors can be introduced by the measuring device. Again, this problem isusually mitigated through quality control and the use <strong>of</strong> multiple samples.• Differential binding Different cRNA fragments have different binding affinities, asdetermined by the thermodynamics <strong>of</strong> the binding process, that are, in turn, dictatedby their chemical structure and environmental factors such as temperature. There<strong>for</strong>e,at equal concentrations, certain fragments are more likely to bind to the probes thanothers. This may result in a biased estimate <strong>of</strong> gene expression, since the weaklybindingfragments will be measured as missing or having low expression relative to thestrongly-binding fragments.• Non-specific binding Each probe on an oligonucleotide microarray is a short fragment<strong>of</strong> cDNA, 25 nucleotides long. A set <strong>of</strong> 16–20 probes <strong>for</strong>ms a probeset. The probeset15


2. Biological Backgroundis intended to measure binding <strong>for</strong> one gene. However, some cRNA fragments can bindto more than one probe, even across probesets, potentially being measured as multiplegenes. This non-uniqueness in binding, also called cross-hybridisation, is a confoundingfactor in measuring gene expression. Oligonucleotide arrays such as those by Affymetrixtry to mitigate this problem by employing two types <strong>of</strong> probes, perfect match (PM)probes and mismatch (MM) probes. The PM probe should be bound by the true gene’scRNA, whereas the MM probe differs in its central nucleotide from the PM probe andis intended to be bound by non-specific fragments. Hence, the simplest approach toaccount <strong>for</strong> non-specific binding is to normalise each PM probe by its matching MMprobe. However, more sophisticated <strong>approaches</strong> have been developed, such as quantilenormalisation (Bolstad et al., 2003), and MM probes are now ignored or have beencompletely removed from recent arrays.• Faulty arrays Some arrays in an experiment may be faulty, <strong>for</strong> example, when many<strong>of</strong> the probes did not hybridise well. Quality control and sometimes visual inspection<strong>of</strong> results in each arrays are necessary to make sure these arrays are discarded prior to<strong>analysis</strong>.• Experimental errors By experimental error we refer to sources <strong>of</strong> variation andnoise that are due to things such mislabelling <strong>of</strong> samples (cases labelled as controls andvice-versa), or measuring the same sample several times unintentionally. Some <strong>of</strong> theseerrors can be detected during the <strong>analysis</strong>, in the quality control stage (<strong>for</strong> example,detecting duplicated samples), however, others, such as mislabelling phenotypes maybe harder to detect and should be avoided in the first place.• Annotation variability Many experiments depend on some external annotation <strong>of</strong>the sample, <strong>for</strong> example, whether the sample comes from a cancer or normal patient.Some <strong>of</strong> these annotations can be variable, especially <strong>for</strong> annotations that are not experimentallymeasured but depend on a clinician’s subjective assessment. In such acase, this variability can be reduced by careful planning and documentation <strong>of</strong> the assessmentprocedure (the criteria <strong>for</strong> assigning the patient to a specific class), and byusing a variety <strong>of</strong> independent assessors.In contrast, intrinsic factors are ones that are potentially biologically meaningful and whichwe may wish to explicitly model them in their own right, once the external factors have beenaccounted <strong>for</strong>. Such intrinsic factors include:• Dynamic range Different genes per<strong>for</strong>m different roles in the cell, and this dictatesthe range <strong>of</strong> expression levels they can take. In particular, mRNA from genes coding<strong>for</strong> transcription factors is known to exhibit a small dynamic range, making it difficultto detect the differential expression signal over the background noise.16


2.3. Gene Expression• Tissue specificity Gene expression <strong>for</strong> some genes is highly dependent on the tissuetype, whereas so-called “house-keeping” genes are active in most tissues since they arerequired <strong>for</strong> the basic functioning <strong>of</strong> the cell. There<strong>for</strong>e, gene expression experimentsthat assay the wrong tissue may fail to detect differential gene expression that is present.The physical distance between the “right” and “wrong” tissue might be very small,leading to situations where both types <strong>of</strong> tissue are assayed together and the resultingsample is a heterogeneous sample <strong>of</strong> many cell types, potentially reducing our ability todetect subtle changes in specific tissue types.• Time specificity Gene expression exhibits strong time dependence, on several timescales. For example, genes in certain bacteria respond to a change in the lactose levelsin the environment by expressing genes responsible <strong>for</strong> producing the lactase enzyme.The expression <strong>of</strong> this gene reaches a steady state over a timescale <strong>of</strong> minutes to hours.Once the lactose has been metabolised, the gene will stop transcribing and expressionwill gradually decrease as the mRNA decays. In contrast, other genes, responsible <strong>for</strong>embryogenesis, may only be active during that stage <strong>of</strong> development. On yet anothertimescale, genes related to the circadian cycle show cyclic patterns <strong>of</strong> expression overthe hours <strong>of</strong> the day. These differing dynamic patterns raise two issues. First, is theexperiment capturing the entire pattern? Most microarray experiments are snapshotsin time. Those that are taken across time (time course experiments) tend to be smallerdue to the extra ef<strong>for</strong>t in measuring more samples. Measuring expression over time isnot practical <strong>for</strong> certain experiments, such as those relying on <strong>human</strong> biopsies. Second,there is the issue <strong>of</strong> sampling frequency — is the experiment precise enough to measurethe events <strong>of</strong> interest? If the time-course experiment is spaced at intervals too far apart,it will fail to capture the higher frequency changes which may be biologically relevant.In practice, except <strong>for</strong> time course experiments, many gene expression experiments onlycapture a snapshot in time, when most gene expression has already reached steady state.• Heterogeneous samples Phenotypes that superficially appear the same may havedifferent underlying causes. For example, breast cancer has been shown to be a heterogeneousdisease, driven by different cellular mechanisms and exhibiting phenotypes suchas different degrees <strong>of</strong> aggressiveness and response to treatment (Loi et al., 2007; Perouet al., 2000; Sørlie et al., 2001; Sotiriou et al., 2006). On the one hand, by analysingsuch sub-populations together, we may increase the statistical power <strong>of</strong> the <strong>analysis</strong>(probability <strong>of</strong> detecting a true association) <strong>of</strong> detecting common genomic mechanisms,at the expense <strong>of</strong> detecting those mechanisms that are distinct. On the other hand,analysing them separately may result in sample sizes that are so small that they reducethe power to detect even the common causes. This is a consequence <strong>of</strong> the well-knownstatistical issue <strong>of</strong> bias-variance decomposition (Hastie et al., 2009a).17


2. Biological Background• Post-translational modifications A post-translational modification is a change tothe protein after it was produced by the ribosome. Modifications include events suchas addition and removal <strong>of</strong> amino acids, phosphorilation, methylation, and acetylation,which affect protein activation, localisation (which part <strong>of</strong> the cell the protein will betransported to), degradation, and ability to interact with other proteins (Mann andJensen, 2003). This phenomena again demonstrates that expression levels, measured asmRNA levels, may not directly correspond to protein levels.• Alternative splicing In alternative splicing, the pre-mRNA exons transcribed froma gene are spliced together in different ways, producing different mRNA molecules andeventually different proteins called iso<strong>for</strong>ms (Blencowe, 2006). Each iso<strong>for</strong>m may functiondifferently in the cell, and the production <strong>of</strong> each iso<strong>for</strong>m can be tissue dependent.This phenomena further weakens the assumption that one gene codes <strong>for</strong> one protein,and that the gene can be assayed by one probeset — generally, each iso<strong>for</strong>m requiresits own probeset. When several iso<strong>for</strong>ms from the same gene are measured, they canprovide conflicting evidence regarding the expression <strong>of</strong> the gene, unless accounted <strong>for</strong>.Far from being a rare phenomena, alternative splicing is estimated to occur in abouthalf <strong>of</strong> all <strong>human</strong> genes (Modrek and Lee, 2002).• MicroRNA MicroRNAs are class <strong>of</strong> small RNA molecules that bind to mRNA anddegrade it, thus decreasing gene expression. MicroRNA are an important part <strong>of</strong> thecell’s protein regulation systems (Baek et al., 2008; Bartel, 2004); currently, there areseveral hundred known variants, and they occur commonly in the cell. As with alternativesplicing, microRNA shows that while a gene can be “active”, in the sense that it isbeing transcribed into mRNA, this does not necessarily translate to protein production.Moreover, gene regulatory interactions are not limited to protein coding genes — ourmental model <strong>of</strong> regulation through transcription factors must be expanded to includemicroRNAs as well.• Non-transcriptional regulation One <strong>of</strong> the most commonly studied mechanism <strong>for</strong>gene regulation is transcriptional regulation: the process in which a gene that codes<strong>for</strong> a transcription factor is expressed, and the factor then modulates the expression <strong>of</strong>another gene. However, aside from transcriptional regulation there are at least two othermajor cellular regulatory systems (Alon, 2007), signalling regulation and metaboliteregulation. With signalling regulation, also called protein-protein interaction networks(PPI), protein interact with other, per<strong>for</strong>ming tasks such relaying in<strong>for</strong>mation to thecell about its environment. Current knowledge <strong>of</strong> signalling regulation mechanisms isrelatively sparse, partly because these networks operate on much shorter timescales thangene expression and because protein levels are harder to assay on a large scale comparedwith gene expression. Metabolite regulation is the regulation <strong>of</strong> gene expression by18


2.3. Gene ExpressionepigeneticsDNAmRNAmicroRNAmetaboliteproteinclinical phenotypeFigure 2.3.: A revised Central Dogma <strong>of</strong> molecular biology. We distinguish between a clinicalphenotype, which is a high level phenotype such as case/control status, and otherphenotypes — many <strong>of</strong> the other nodes can be considered phenotypes in theirown right, such as gene expression (mRNA) in eQTL studies.metabolites inside and outside the cell, as in the bacterial lactose response discussedpreviously. Since many transcriptomic analyses examine gene expression in isolationfrom proteomic and metabolomic data, they produce an incomplete picture <strong>of</strong> the cell’sactivities. Some recent examples <strong>of</strong> studies that integrate several data sources, such asgene expression, metabolites, and genetic data, include Chen et al. (2008) and Inouyeet al. (2010a).• Pathways and topology Apart from the distinction between regulatory, signalling,and metabolic pathways, there is the issue <strong>of</strong> pathway topology, that is, the networkstructure. Gene networks can have different topologies, depending on their cellularrole (Alon, 2007). The observed complexity <strong>of</strong> gene networks is partly driven by evolutionaryselective pressure towards genetic buffering — stability <strong>of</strong> the phenotype inthe face <strong>of</strong> mutations (Moore, 2005). By not relying on any one gene <strong>for</strong> its operation,the system is more resilient to potential damage. While buffering confers evolutionaryadvantages, it makes the <strong>analysis</strong> <strong>of</strong> transcriptomic data more difficult, since it is harderto separate the contributions <strong>of</strong> each gene to the phenotype — a perturbation to theexpression <strong>of</strong> one gene may be insufficient to affect the phenotype. Moreover, some <strong>of</strong>the marginal contributions may be weak in themselves but constitute part <strong>of</strong> a largerepistatic mechanism. In the Chapter 4 we discuss several <strong>approaches</strong> <strong>for</strong> analysingtranscriptomic data from the pathway perspective.Having surveyed some <strong>of</strong> the biological phenomena that indicate that our picture <strong>of</strong> gene→ mRNA → protein is incomplete, we may consider a revised Central Dogma, taking into19


2. Biological Backgroundaccount the factors we have discussed: DNA, mRNA, microRNA, proteins, epigenetics (discussedin Section 2.4.6), and metabolites, as shown in Figure 2.3. The issue <strong>of</strong> what constitutesa phenotype is subjective; we may consider some high level manifestation <strong>of</strong> disease as thephenotype, or we may consider protein or metabolite concentrations as phenotypes as well. InSection 2.4.8 we discuss expression quantitative trait loci (eQTL) studies, where gene expressionitself is considered a phenotype driven by genetic factors, and other external phenotypesare considered to be downstream <strong>of</strong> the genes.Despite the limitations <strong>of</strong> gene expression experiments, and notwithstanding the fact thatgene expression is only a partial description <strong>of</strong> cellular activity, gene expression microarrayshave been a mainstay <strong>of</strong> modern molecular biology <strong>for</strong> three main reasons. First, gene expressionmicroarrays enable us to assay tens <strong>of</strong> thousands <strong>of</strong> probesets simultaneously, and gaininsight into the important mechanism <strong>of</strong> gene expression. Second, their low cost comparedwith alternative technologies such as RNA-seq (Wang et al., 2009) and proteomic methodshas made gene expression arrays attractive to researchers. Third, <strong>analysis</strong> <strong>of</strong> gene expressiondata has become relatively routine, and generally does not require specialised computing resourcesonce the data have been preprocessed, in contrast with sequencing data that mustundergo extensive postprocessing such as alignment and assembly.2.4. The Genetic Basis <strong>of</strong> DiseaseGenetics is concerned with inheritance <strong>of</strong> traits (phenotypes) — questions such as whichphenotypes are heritable, how heritable are they (the genetic component <strong>of</strong> the observedvariability), what are the mechanisms <strong>of</strong> heritability, and which mutations are responsible <strong>for</strong>important traits such as disease. DNA, ignoring <strong>for</strong> the moment the role <strong>of</strong> epigenetics, is themajor basis <strong>for</strong> heredity, being passed from parents to child. One important vehicle <strong>of</strong> heritabledisease is the single-nucleotide polymorphism (SNP), which is a population-level variation inone DNA base. (A SNP is usually referred to as a variant rather than a mutation, since wedo not know which <strong>of</strong> the variants is the wild type and which is the mutant.) Typically, onlySNPs that are common enough in the population are assayed in microarrays, as SNPs with lowMAFs require large sample sizes in order to be genotyped confidently. How common a SNPis in the population is measured by its minimum allele frequencies (MAF). To understandthe importance <strong>of</strong> SNPs in disease, we must first understand some basic genetic facts.In diploid organisms, such as <strong>human</strong>s, there are two copies <strong>of</strong> each chromosome, one fromeach parent (except <strong>for</strong> the sex chromosomes). Each chromosome in the pair is said toprovide one allele — one <strong>of</strong> the bases A, G, C, and T. For our purposes here, the actual basedoes not matter. Typically, we deal with biallelic variants, there<strong>for</strong>e one allele is arbitrarilydenoted as “A” and the other as “B” (the labels themselves do not imply any ordering <strong>of</strong>the alleles). Taken together, each individual has at one locus (DNA position) a pair <strong>of</strong> alleles20


REVIEWS2.4. The Genetic Basis <strong>of</strong> Diseasecross environments.s can be inflated ifitive genetic effectstion), shared familialns among genotypesstimated from pediilityestimated fromstimates from familyty <strong>of</strong> environmentalossible because thevides empirical estibetweenpairs <strong>of</strong> relalftheir genetic comlargestudy it rangedpic differences to theEffect size50.0HighIntermediateModestLow3.01.51.1Very rareRare allelescausingMendeliandiseaseRare variants <strong>of</strong>small effectvery hard to identifyby genetic meansLow-frequencyvariants withintermediate effect0.001 0.005 0.05Rare Low frequencyAllele frequencyFew examples <strong>of</strong>high-effectcommon variantsinfluencingcommon diseaseCommonvariantsimplicated incommon diseaseby GWACommonarker data wereFigure used 2.4.: Figure Different 1 | Feasibility phenotypes <strong>of</strong> identifying are characterised genetic variants by different by risk combinations allele frequency <strong>of</strong> variant fre-strength and <strong>of</strong> genetic effect size. effect Reprinted (odds ratio). by Most permission emphasis from andMacmillan interest liesPublishersods but free <strong>of</strong> their38.Thisisremarkably andquencyin identifying Ltd (Manolio associations et al., 2009), with characteristics copyright (2009). shown within diagonal dottedlines. Adapted from ref. 42.eritability is not overrelatedor ‘unrelated’ out <strong>of</strong> the three possible combinations — AA (homozygous <strong>for</strong> A), AB (heterozygous), orans 39 ;giventhenumlityestimates could be Detectionarduous sample preparation characteristic <strong>of</strong> capillary sequencing 43 .BB (homozygous <strong>for</strong><strong>of</strong>B).associationsHere, we dowithnotlowdistinguishfrequencybetweenand rareAB andvariantsBA becausewill bethese twogenotypesntial confounding by facilitated are mostly functionally by the comprehensive identical — it does catalogue not matter<strong>of</strong> which variants allele came withfrom whichparent, only MAF whether $ 1% the being <strong>of</strong>fspring generated actually by has the the 1,000 allele Genomes or not. The Project process (http:// <strong>of</strong> determiningitability will facilitate the sourcewww.1000<strong>genome</strong>s.org/page.php), <strong>of</strong> each allele (mother or father) is known whichas will phasing. also identify manyriance that has been Geneticvariants traits are at roughly lower allele divided frequencies. into two classes: The pilot Mendelian ef<strong>for</strong>t <strong>of</strong> traits, that program and multifactorialates, it may still traits. be Inhas Mendelian already identified traits, a single moreSNP than may 11 million be sufficient new SNPs to trigger in initially disease. low-Mendeliadepthcoverage <strong>of</strong> 172 individuals 44 .en explained by prefromtrait-associated Current mechanisms <strong>for</strong> using sequencing to identify rare variants2 is andiseases tend to be rare, but severe in their effect. For example, cystic fibrosis (CF)autosomal recessive disease caused by a mutation in the CFTR gene on chromosome 7, withypes with the actual underlying or co-located with GWA-defined associations includean estimatedtive genetic variance, sequencing population in genomic prevalence regions <strong>of</strong> about defined 1 inby 1972 strong births and(Scotet repeatedly et al., replicatedand associations recessive, and with having common two copies variants, <strong>of</strong> aand defective sequencing CFTRa larger gene guarantees frac-that2003). CF isMendelianctual phenotype willheritability estimates the disease tion will <strong>of</strong>develop. the <strong>genome</strong> In contrast, in peoplemany with extreme diseases with phenotypes. known genetic In the absence components, suchs <strong>of</strong> available genetic as some types <strong>of</strong> GWA-defined <strong>of</strong> cancer, hypertension, signals, sequencing Type 1 candidate Type 2genes diabetes, in subjects and celiac at thedisease, areention and treatment multifactorial, extremes in that <strong>of</strong> a they quantitative are thought trait (such to depend as lipid onlevels a relatively or the age large at number onset), <strong>of</strong> SNPs.counting <strong>for</strong> riskMultifactorial in a can identify diseasesother tend to associated be more common variants, then bothMendelian common diseases and rare— 45,46 <strong>for</strong> .An example, thesis.important finding from these studies is that much <strong>of</strong> the in<strong>for</strong>mation isestimated prevalence is 1% <strong>for</strong> celiac disease, and <strong>for</strong> hypertension in the USA is it 7.3–66.3%provided by people at the extremes <strong>of</strong> trait distributions, who seem to bedepending on age (Ong et al., 2007), although <strong>for</strong> T1Dymore likely to carry loss-<strong>of</strong>-function alleles 47 the prevalence is only about 0.3%..Hypotheseslity from GWAS has Sample on thesizes genetic used architecture <strong>for</strong> the initial <strong>of</strong> multifactorial identification disease <strong>of</strong> are DNA based sequence on the evolutionarys <strong>of</strong> low minor alleleprinciple variants that strong have generally SNPs tendbeen to bemodest, rare — as and with sample Mendelian size requirementsdisease — whereas weak, MAF , 5%, or <strong>of</strong> increase essentially linearly with 1/MAF. Much larger samples arenot sufficiently 2 frengarrays 14,41 , nor do needed <strong>for</strong> the detection <strong>of</strong> the variants themselves. They also scalehttp://www.ncbi.nlm.nih.gov/omim/219700needed <strong>for</strong> the identification <strong>of</strong> associations with variants than thosedetected by classical roughly linearly with 1/MAF given a fixed odds ratio and fixed degreece MAF falls below <strong>of</strong> linkage disequilibrium with genotyped markers. Sample size <strong>for</strong> 21ly unless effect sizes association detection also scales approximately quadratically withmodest effect sizes, 1/j(OR 2 1)j, and thus increases sharply as the odds ratio (OR)f overall ‘mutational declines. Sample size is even more strongly affected by small odds


2. Biological BackgroundSNPs tend to be common. The reason <strong>for</strong> the difference in frequency is negative selectionpressure. The marginal contribution <strong>of</strong> each SNP to the individual’s fitness influences thedegree <strong>of</strong> selection against it. Lethal traits are strongly selected against, especially when theyaffect an individual be<strong>for</strong>e reproductive age, whereas weakly-acting (low penetrance) SNPsthat affect the individual later in life incur less negative selection pressure. The first majorhypothesis <strong>for</strong> the architecture <strong>of</strong> common disease is that there are many weak SNPs contributingto the disease — the common disease common variant (CDCV) hypothesis (Bodmerand Bonilla, 2008; Lohmueller et al., 2003; Pritchard and Cox, 2002; Reich and Lander, 2001).A competing hypothesis — common disease rare variant (CDRV) — assumes that commondisease is caused by a small number <strong>of</strong> rare but strong SNPs. In practice, there is likely to bea continuum <strong>of</strong> both allelic frequency and the size <strong>of</strong> the variant’s effect (Manolio et al., 2009),as illustrated by Figure 2.4, and both hypotheses probably hold true to different extent indifferent traits (Schork et al., 2009). Genome-<strong>wide</strong> association studies (GWAS), which aim t<strong>of</strong>ind strong SNP associations with phenotypes by examining hundreds <strong>of</strong> thousands <strong>of</strong> knownSNPs (see Section 2.4.5), are premised on the CDCV assumption since they largely examinevariants that were a-priori known to be common or they could be confidently declared asSNPs from sequencing data. These constraints impose a lower limit on allele frequencies thatare considered to be SNPs, usually around 1% MAF.2.4.1. Linkage DisequilibriumLinkage disequilibrium (LD) is the phenomena in which regions <strong>of</strong> DNA that are physicallyclose to each other tend to be more highly correlated in terms <strong>of</strong> their genotype than thoseregions that are far apart. LD can be explained by genetic recombination. Recombinationis the result <strong>of</strong> meiosis, the process in which the DNA in a diploid organism is split betweenits haploid gametes (sperm <strong>for</strong> males and eggs <strong>for</strong> females), and later recombined during thefertilisation and creation <strong>of</strong> the diploid <strong>of</strong>fspring. Since, to first approximation, recombinationoccurs uni<strong>for</strong>mly across the DNA 3 , loci that are close to each other have a higher probability<strong>of</strong> being inherited together than loci that are further apart. LD thus manifests itself throughblocks <strong>of</strong> highly correlated SNPs, as shown in Figure 2.5. A set <strong>of</strong> SNPs commonly inheritedtogether on the same chromosome is called a haplotype. LD has implications both <strong>for</strong> <strong>analysis</strong><strong>of</strong> SNP data and <strong>for</strong> biological interpretation, see Chapter 3 <strong>for</strong> discussion.LD is estimated from the data, using several alternative <strong>approaches</strong>. Assume we havetwo SNPs, with two alleles each, ‘A’/‘a’ and ‘B’/‘b’, respectively. The joint probabilities <strong>of</strong>observing the combinations <strong>of</strong> these two alleles arep AB ∶= P (A, B), p Ab ∶= P (A, b), p aB ∶= P (a, B), p ab ∶= P (a, b).3 Recombination is now known to occur in hotspots and not uni<strong>for</strong>mly across the DNA (Myers et al., 2005).22


© 2010 Nature America, Inc. All rights reserved.mately 32 kb upstream <strong>of</strong> the gene (rs10919791, P = 6.37 × 10 −10 ; the association with rs4per allele OR 0.77, 95% CI 0.71–0.84; unconstrained OR het 0.76, 95% by several orders <strong>of</strong> magCI 0.68–0.84 and unconstrained OR hom 0.63, 95% CI 0.50–0.79). The these findings suggestlinkage disequilibrium (LD) between these two SNPs is high, with allele, but further fine-mr 2 = 0.81 in study controls and r 2 = 0.71 in the HapMap CEU. In this NR5A2 encodes a nuregion, there were three additional SNPs, rs3790843, 2.4. The rs12029406 Genetic and Basis <strong>of</strong>subfamily Disease that is predoa−log 10(P)1210864272,721,214 72,854,007rs1327620rs1023102rs1330074rs1023101rs7319076rs9573147rs2875652rs4885080rs985740rs7998062rs1410104rs1928634rs7987880rs9543307rs6562756rs9543310rs9573155rs11838548rs4885088rs840418rs11619223rs9318155rs9318156rs1411327rs1411326rs2210077rs4885091rs2031990rs9564966rs287553rs287548rs192607rs11843025rs9543325rs9592907rs11840175rs1411320rs9318166rs12870000rs1886449rs2224916rs7999587rs7983696rs9543335rs17090102rs9543336rs17090113Physical distance: 132.8 kbLD map type: r 20 0.2 0.4 0.6 0.8 1483624120CombinedCase controlCohortsLRb–log 10(P)108642198,125,014Figure 2.5.: Association <strong>for</strong> SNPs in chromosome 13 (q22.1) with pancreatic cancer. The 8 di-Figure amonds, 1 Association and squaresresults, represent recombination the log 10 p-values and linkage <strong>for</strong> association <strong>of</strong> SNPs. Overlaidare the recombination plots <strong>for</strong> 13q22.1, rates 1q32.1 (centimorgan and 5p15.33. per megabase). Association On the bottomdisequilibriumresults is a heatmap are shown showing the top the panel LD between <strong>for</strong> all cohort the SNPs, studies measured (blue by r 2 . Reprinted 6squares), by permission case-control from Macmillan studies (green Publishers squares) Ltd and (Petersen all studies et al., 2010), copyrightcombined (2010). (red diamonds). Overlaid on the association panel <strong>for</strong>each locus is a plot <strong>of</strong> recombination rates (cM/Mb) across the4region from CEU study controls. (a) The LD plot shows a regionThe two SNPs<strong>of</strong> chromosomeare said to13q22.1be in linkagemarkedequilibriumby the SNPswhenrs9543325they areandindependent: the jointprobabilitiesrs9564966 p AB , ..., p and factor bounded as products by SNPs <strong>of</strong> the between marginal 13q22.1:72,721,214probabilities, namely p AB = p A p B ,p Ab = p A p b , and p aB13q22.1:72,854,007. = p a p B , and p ab = p a pThese b , where SNPs p Aare , p awithin , p B , a and 600-kb p b are the probabilities 2 <strong>of</strong>intergenic region between KLF5 and KLF12. (b) The LD plotobserving allele ‘A’ at the first SNP, allele ‘a’ at first SNP, allele ‘B’ at the second1,296,475SNP,shows a region <strong>of</strong> chromosome 1q32.1 marked by five SNPs,TERTandSLC6A18allele ‘b’ at the rs3790844, second SNP, rs10919791, respectively. rs3790843, rs12029406 andSince these rs4465241, probabilities and are bounded neverby directly SNPs between observed, 1q32.1:198,125,014but instead estimated from allelefrequencies in and finite 1q32.1:198,317,613. data, there may be Note some that random rs3790844 fluctations and rs3790843 from perfect equilibrium.are located in the first intron <strong>of</strong> NR5A2, shown above the LD plot.One measure <strong>of</strong> this deviation is D, defined as(c) The LD plot shows a region <strong>of</strong> chromosome 5p15.33 markedby rs401681 and bounded by SNPs between 5p15.33:1,296,475and 5p15.33:1,476,905. rs401681 D = p AB −is p A located p B , in the 13th intron(2.1)<strong>of</strong> CLPTM1L, shown above the LD plot and 27 kb from the TERTwhere D ≥ 0, gene. p For all panels, LD (r 2 A and p B are estimated ) from is depicted the observed <strong>for</strong> SNPs allele with frequencies. minor Estimating p AB isallele frequency (MAF) > 5% using PanScan controls <strong>of</strong> Europeanbackground (n = 3,650 unrelated individuals). Locations are fromNCBI Genome Build 36.c−log 10(P)rs10919761rs12120925rs2363451rs6656594rs6694219rs6695659rs6677214rs12029406rs17665538rs10919778rs4915398rs6662512Physical distance: 192.6 kbLD map type: r 20 0.2 0.4 0.6 0.8 1CLPTM1Lrs4075202rs4073918rs2736122rs4975605rs2736100rs2853676rs2853668rs4635969rs4975616rs402710rs10073340rs401681rs31489Physical distance: 180.4 kbLD map type: r 20 0.2 0.4 0.6 0.8 123226 VOLUME 42 | N


2. Biological Backgroundless straight<strong>for</strong>ward, since this is the probability <strong>of</strong> the AB haplotype, which is not generallyknown in population studies as it depends on the phase: the haplotypes AB/ab (one on eachchromosome <strong>of</strong> the chromosome pair) cannot be distinguished from the haplotypes aB/Abwithout knowing which allele came from the father and which came from the mother. There<strong>for</strong>e,estimating p AB requires either phasing the genotypes or more sophisticated estimation<strong>approaches</strong> such as Expectation-Maximisation (Foulkes, 2009). A value <strong>of</strong> D = 0 representsperfect equilibrium, and positive values represent deviations from equilibrium.One drawback <strong>of</strong> the D measure is that its range <strong>of</strong> D depends on the allele frequencies,there<strong>for</strong>e it cannot be meaningfully compared across SNPs with different frequencies. Anotherstatistic is thus the D ′ , defined aswhereDmax =⎧⎪⎨⎪⎩D ′ =∣D∣Dmax , (2.2)min{p A p b , p a p B } D > 0min{p A p B , p a p b } D < 0The D ′ statistic is scaled such that 0 ≤ D ′ ≤ 1, where a value <strong>of</strong> 0 represents equilibrium andvalue <strong>of</strong> 1 is called complete LD.Another related measure <strong>of</strong> LD is r 2 , which is equivalent to the squared Pearson correlationbetween the genotypes, and can be expressed asr 2 =D 2p A p B p a p b, (2.3)where 0 ≤ r 2 ≤ 1. A value <strong>of</strong> r 2 = 1 is called perfect LD, indicating that the two SNPs havethe same genotypes in the samples tested.2.4.2. Hardy-Weinberg EquilibriumBriefly, the Hardy-Weinberg principle states that in a randomly-mating population (whereevery individual has the same probability <strong>of</strong> mating with any other individual), free <strong>of</strong> selection,mutation, and migration, the genotypes frequencies at a given locus follow a binomialdistribution that is a function <strong>of</strong> the allele frequencies (Falconer and Mackay, 1996; Hedrick,2009) (<strong>for</strong> biallelic loci), and the allele frequencies will be fixed between the generations. Inother words, given two parental alleles A and B with respective frequencies p and q (wherep + q = 1), the <strong>of</strong>fspring genotype frequencies <strong>for</strong> the three possible genotypes are p 2 (homozygote<strong>for</strong> A), 2pq (heterozygote), and q 2 (homozygote <strong>for</strong> B). The loci is then said to bein Hardy-Weinberg equilibrium (HWE). Note that HWE applies to each locus separately, asdifferent loci can be under different selection pressures and different mutation rates.The Hardy-Weinberg principle is useful in case/control <strong>genome</strong>-<strong>wide</strong> association studies(GWAS, see Section 2.4.5), as significant deviations <strong>of</strong> the allele frequencies from HWE (in24


2.4. The Genetic Basis <strong>of</strong> Diseasethe controls) may indicate genotyping errors.2.4.3. SNP Microarray TechnologySNP microarrays are similar to gene expression microarrays in that they measure thousands <strong>of</strong>probes simultaneously. The main difference, however, is that they are not used <strong>for</strong> measuringexpression through binding <strong>of</strong> mRNA, but rather <strong>for</strong> measuring the binding intensity <strong>of</strong> DNAitself, <strong>for</strong> each allele <strong>of</strong> each SNP. Two <strong>of</strong> the major manufacturers <strong>of</strong> commercial SNP arraysare Affymetrix (Santa Clara, CA, USA) and Illumina (San Diego, CA, USA).Modern SNP arrays measure between 500,000 to over 1 million SNPs at a time. The SNPsrepresented on the arrays were chosen to represent (tag) most <strong>of</strong> the variants identified bythe HapMap 4 project (International HapMap 3 Consortium, 2010; International HapMapConsortium, 2005, 2007), which has sought to characterise the genetic variability in several<strong>human</strong> populations. As <strong>of</strong> HapMap3, there are more than 3 million known genetic variants,and 1000Genomes (1000 Genomes Project Consortium, 2010) 5 has around 38 million SNPsas <strong>of</strong> early 2012; currently, they cannot all be measured on the SNP array. However, this isnot a major impediment due to LD — unobserved SNPs exhibiting high LD with observedSNPs can be imputed based on the observed ones, thus increasing the effective number <strong>of</strong>SNPs on the array. Depending on the quality <strong>of</strong> imputation, the imputed SNPs can be tested<strong>for</strong> association with the phenotype as if they were measured on the array in the first place(see Section 2.4.4).The basic steps in a SNP array experiment are (LaFramboise, 2009)• DNA extraction If we are after germline (non somatic) variations, then many bodytissues are suitable since their DNA is identical. Blood samples or cheek swabs arecommon sources <strong>of</strong> cells from which DNA can be extracted.• Hybridisation and washing DNA is hybridised to the array, which contains shortprobes, designed to be complementary to DNA fragments containing known SNPs.Each allele is measured by a separate probe. As with gene expression arrays, earlierAffymetrix SNP arrays had perfect match (PM) and mismatch (MM) probes, intendedto measure non-specific binding <strong>for</strong> later correction, whereas more recent Affymetrixarrays, such as the Human SNP Array 6.0 use PM probes only. For Illumina arrays,there is only one probe <strong>for</strong> each allele <strong>of</strong> each SNP. After hybridisation, the remainingunbound material is washed away.• Measurement As with gene expression microarrays, SNP arrays are laser scanned toproduce a digital representation <strong>of</strong> the signal intensities.4 http://www.hapmap.org5 http://www.1000<strong>genome</strong>s.org25


2. Biological Background• Preprocessing Each allele may have multiple probes, and their individual bindingintensity is statistically summarised to <strong>for</strong>m the signal intensity <strong>for</strong> the allele. Fromthe ratios <strong>of</strong> the signal at the two alleles at each locus, and the identity <strong>of</strong> the probes(matching SNPs with A, T, G, or C), the discrete genotypes (AA, AB, BB) are inferred.Genotype calling, as this process is called, is per<strong>for</strong>med in Affymetrix arrays by toolssuch as RLLM (Rabbee and Speed, 2006) and the subsequent BRLMM (Affymetrix,Inc., 2006), CHIAMO (Marchini et al., 2007), and Birdseed (Korn et al., 2008), and inIllumina arrays by methods such as Illuminus (Teo et al., 2007). The basic principlebehind genotype calling, shown in Figure 2.6, is clustering <strong>of</strong> the samples into threegroups: homozygous <strong>for</strong> A, heterozygous (AB), and homozygous <strong>for</strong> B. The genotypecalling methods differ in their statistical assumptions, such as the distribution <strong>of</strong> thesamples in each cluster. In the process <strong>of</strong> genotype calling, SNPs with ambiguousgenotypes are discarded. Typically, SNPs are also filtered based on MAF≥ 1% (sincethere is <strong>of</strong>ten not enough statistical power to confidently detect rarer variants) anda statistical test <strong>for</strong> Hardy-Weinberg equilibrium, which tests <strong>for</strong> the statistical nonindependence<strong>of</strong> the two alleles, a useful indicator <strong>of</strong> genotyping errors.• Analysis Once the data has been preprocessed and the genotypes have been called,<strong>analysis</strong> <strong>of</strong> the data can begin. Below, we discuss Genome-Wide Association studies,and in Chapter 3 we discuss some <strong>of</strong> the statistical and computational principles behindthe <strong>analysis</strong> <strong>of</strong> genetic data.2.4.4. ImputationRoughly speaking, imputation refers to the process <strong>of</strong> “filling in” unobserved genotypes in adataset, leveraging haplotype patterns in known data to infer the genotypes <strong>of</strong> the unknownSNPs in our dataset <strong>of</strong> interest. These genotypes may be unobserved since they were notassayed in the first place, or they may contain missing values representing the situation wherethe genotype calls could not be made with sufficient confidence (low posterior probability).The haplotype in<strong>for</strong>mation is available in the <strong>for</strong>m <strong>of</strong> a reference panel, such as thoseavailable from HapMap (International HapMap 3 Consortium, 2010; International HapMapConsortium, 2005, 2007) and the 1000Genomes project (1000 Genomes Project Consortium,2010). These panels are densely genotypes, <strong>of</strong>ten containing several million SNPs, and theirphase is known (hence the haplotypes are known). Apart from considering the correlations,there are also advantages in considering distance and recombination rates, as to discount theeffect <strong>of</strong> SNPs that are further away from the SNP being imputed (Marchini et al., 2007).Once the missing genotypes have been imputed, and the quality <strong>of</strong> the imputation has beenverified, they can be treated like any assayed SNP and tested <strong>for</strong> association with phenotypes.This is especially useful since most current <strong>genome</strong>-<strong>wide</strong> studies assay far fewer SNPs than26


TECHNICAL REPORTS2.4. The Genetic Basis <strong>of</strong> Diseasegenotypes at thisprobability was 0cordance with thecalls made by IlluRaw genotype callsRaw + imputed genotype callsWhen we used o1.21.2average maximum0.80.8method is verycalls. These impu0.40.4the 1,444 individAffymetrix and0.00.0perfectly with the0.0 0.4 0.8 1.20.0 0.4 0.8 1.2 conclude that theA allele intensityA allele intensityimprove data qualAlthough we chFigure 5 Imputing missing data at genotyped SNPs. The left panel shows the normalized intensity dataFigure 2.6.: Measured intensities <strong>for</strong> two alleles <strong>of</strong> one locus, over all samples in the 1958larly good exampl<strong>for</strong> a SNP genotyped using both the Affymetrix chip and the custom Illumina chip in the 1958 birthbirth cohort <strong>of</strong> the WTCCC data (The Wellcome Trust Case Control Consortium, tion, we have foucohort <strong>of</strong> the WTCCC study. The x and y axes on this plot denote the intensity measurements <strong>for</strong> thetwo alleles (A 2007). and B, The respectively) samplesatcoloured the SNP. Each red, green, point represents and blue theare measurement called as <strong>for</strong> BB, a single AB, and AA, systematic improvindividual. The respectively. points in theThe left panel lightare blue colored colour according represents to the genotype missing calls made (CHIAMO by the genotypeSNPs with less maalgorithm used calls bymade the WTCCC withproject posterior (CHIAMO) probability using a calling < 0.9). threshold The<strong>of</strong>left 0.9 and on theright posterior panels shows <strong>for</strong> example, thatprobability <strong>of</strong> the genotype calls be<strong>for</strong>e calls (blue and denotes afterAA, imputing green denotes the AB, missing red denotes calls, BBrespectively.and light blue denotes ReprintedSNPs in the WTCmissing). The right panel shows the non-missing CHIAMO genotype calls plus the imputed missingby permission from Macmillan Publishers Ltd (Marchini et al., 2007), copyright rather than actcalls. The imputed data agreed 100% with Illumina calls at both missing and non-missing genotypes.(2007).noticeably reducenot shown). Thecalled genotypesaremore available detailed in view the reference <strong>of</strong> the associated panel, region. but theThe unassayed results from SNPs imputed can be imputed clear and and is likely assessed to vary as from study to stwell. SNPs are useful in (i) assessing strength <strong>of</strong> signal within the region; downstream <strong>analysis</strong> methods used. For thExamples (ii) providing <strong>of</strong> <strong>wide</strong>ly a <strong>wide</strong>r range used imputation <strong>of</strong> SNPs <strong>for</strong> follow tools up include and (iii) IMPUTE2 indicating(Howie <strong>for</strong>ward et al., to 2009, quantify 2011), the gain in power fMACHpossible(Lilocationset al., 2010),<strong>for</strong>andcausalBEAGLEvariants.(BrowningFor example,and Browning,there is a2007).imputation, but the issue seems worthysubstantially stronger signal from an imputed SNP (rs7903146) in empirical study.the region than <strong>for</strong> any <strong>of</strong> the typed SNPs. This predicted pattern is2.4.5. confirmed Genome by direct Wide genotyping Association <strong>of</strong> the SNPs Studies in question 10 . The use <strong>of</strong> a DISCUSSIONbayesian measure <strong>of</strong> association leads to a similar picture (SupplementaryFig. 3 online).can be thought <strong>of</strong> as predicting missing daAll multipoint methods <strong>for</strong> testing associatiGenome-Wide Association Studies (GWAS) seek to find SNPs that are statistically significantlyIn addition, associated wewith observed phenotype, strong correlation with thebetween aim <strong>of</strong> the detecting extent and the genetic it is becoming basis <strong>for</strong> clear various that a missing data apobserved decay <strong>of</strong>traits the signal such <strong>of</strong> asassociation disease. Prior and the tounderlying GWAS, the recombination most common many type problems <strong>of</strong> study in was genetics a7,12–14 . In this plinkage rate. Furthermore, study, whereestimates heritable <strong>of</strong> traits the certainty were studied <strong>of</strong> the imputation along family at each trees. imputation In contrast, engine GWAS <strong>for</strong> genotypes is with a prSNP indicate that the underlying model shows high confidence <strong>for</strong> our studies, but we also emphasize the broadeusually based on large random samples from unrelated individuals.imputations. These observations are not restricted to this region. For in pointing toward a unifying frameworThe example, traits inunder the 15 regions consideration in the WTCCC in GWAS study arewith either strong continuous, signals <strong>of</strong> such example, as height we note or weight, that our approach <strong>for</strong>or association binary, such at genotyped as case-control SNPs <strong>for</strong> in disease. the trend The test, there basicwere GWAS nine<strong>analysis</strong> directly is tocomparable test each SNP to that used in paraindividually which the P<strong>for</strong> value association <strong>for</strong> the best with imputed phenotype, SNP reduced and tothe discard P value SNPs <strong>of</strong> the thatwhich do not the pass genotypes a stringent <strong>of</strong> an untyped variantp-valuebest genotypedcut<strong>of</strong>f (seeSNPChapterpassing3 <strong>for</strong>qualitydetails).controlThebyremaininga factor <strong>of</strong>SNPsat leastare typicallyover to testagain<strong>for</strong>testedcorrelation<strong>for</strong>between genet1.5, and there were four where the change was by more than an order Both methodologies use a genetic map acassociation<strong>of</strong> magnitudein an 6 .independent validation dataset, and if they are stillthehighlycontributionsignificantmadethenby the markers sit is taken as strong evidence <strong>for</strong> true association. In the <strong>genome</strong>-<strong>wide</strong> causal context, locus, and statistical both methods use a likelihsignificance Validationmust andaccount missing <strong>for</strong> data the imputation problem <strong>of</strong> atmultiple genotyped testing: SNPsthe probability involves<strong>of</strong>summation a false positive over untyped variation(type Another 1 error) application increasing <strong>of</strong> our rapidly imputation with the engine number is in validation <strong>of</strong> tests: <strong>of</strong> assuming called kthat independent parametric testslinkage, at a precise famigenotypes and imputation <strong>of</strong> missing data at SNPs that are actually together with a model <strong>of</strong> haplotype inhegenotyped in a study. Genotypes can still be imputed at such SNPs(excluding the genotypes at the SNP from the in<strong>for</strong>mation used <strong>for</strong>imputation) to provide independent estimates <strong>of</strong> genotypes at anytyped SNP. To illustrate this, we show the normalized intensity datafrom which genotypes are called <strong>for</strong> a SNP genotyped in the 1958variation, whereas in case-control studies,replaced by an unknown population genealbased on an unknown genealogy 27 is a morefacilitated by the use <strong>of</strong> a population geneThe imputation methods at the core <strong>of</strong> obirth cohort <strong>of</strong> the WTCCC study on the Affymetrix chip (Fig. 5). The to the Elston-Stewart 15 and Lander-Gre© 2007 Nature Publishing Group http://www.nature.com/naturegeneticsB allele intensityB allele intensity


2. Biological BackgroundType 1 Error Rate0.2 0.4 0.6 0.8 1.0●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●0 20 40 60 80 100Tests kFigure 2.7.: The family-wise Type 1 error rate <strong>for</strong> k independent tests. The per-test thresholdis α = 0.05.28


2.4. The Genetic Basis <strong>of</strong> Diseasenominal threshold <strong>of</strong> α <strong>for</strong> each test, the probability <strong>of</strong> at least one test being a false positiveisP (at least one false positive) = 1 − (1 − α) k ,as shown in Figure 2.7. This potentially leads to spurious associations being found when largenumber <strong>of</strong> SNPs are tested at an otherwise nominally significant threshold such as P ≤ 0.05.Several corrections <strong>for</strong> the multiple testing issue include variants <strong>of</strong> the Family-Wise ErrorRate corrections and False Discovery Rate (FDR) corrections (Benjamini and Hochberg,1995; Storey and Tibshirani, 2003). The simplest correction is the Bonferroni correction,which controls the family-wise error, or the probability <strong>of</strong> at least one error over all tests, andsimply states that the corrected significance <strong>for</strong> k tests is α/k, <strong>for</strong> a given nominal significancelevel α. In <strong>genome</strong>-<strong>wide</strong> studies, <strong>genome</strong>-<strong>wide</strong> significance is usually taken to be P < 5 × 10 −8which is equivalent to a Bonferroni correction <strong>for</strong> one million tests with a per-test α = 0.05;significance must be more stringent when more SNPs are tested.The completion <strong>of</strong> the <strong>human</strong> <strong>genome</strong> project (The International Human Genome MappingConsortium, 2001; Venter et al., 2001) and improvements in sequencing technologies havelead to projects such as HapMap (International HapMap 3 Consortium, 2010; InternationalHapMap Consortium, 2005, 2007) and 1000Genomes (1000 Genomes Project Consortium,2010), characterising DNA variation across <strong>human</strong> populations. Together with better identification<strong>of</strong> variants, the development <strong>of</strong> SNP arrays has allowed GWAS to become a commonmethod <strong>for</strong> analysing genetic associations with phenotype — as <strong>of</strong> early 2011, the NationalHuman Genome Research Institute Catalog <strong>of</strong> Published Genome-Wide Association Studies(NHGRI GWAS) (Hindorff et al., 2009) 6 lists more than 4000 significant SNP-phenotype associationscurated from almost 750 studies, covering almost 450 phenotypes. We discuss themethods used in GWAS in Chapter 3.2.4.6. Challenges in Analysis <strong>of</strong> SNP MicroarraysAs with gene expression experiments, there are challenges and obstacles with SNP arrayexperiments and GWAS. The issues <strong>of</strong> measurement noise, sample preparation and handling,batch effects, and non-specific binding are similar to those in gene expression arrays. Issuesspecific to SNP arrays and GWAS in general include• Population structure SNPs are inherited, and may vary greatly across different<strong>human</strong> populations. The inclusion <strong>of</strong> different populations in one study, without appropriatecorrection, may lead to spurious associations (false positives). To mitigatethese effects, GWAS are typically limited to one heterogeneous population, or statisticalmethods such as Principal Component Analysis are used to correct <strong>for</strong> populationstructure prior to <strong>analysis</strong> (Price et al., 2006).6 http://www.<strong>genome</strong>.gov/gwastudies/29


2. Biological Background• Non-random sampling GWAS assumes random samples from the population. Includingrelated samples, such as samples from siblings, can potentially confound theassociation between genotype and phenotype, since siblings are more likely to experiencecommon environmental factors that affect the phenotype. Unless accounted <strong>for</strong> inthe <strong>analysis</strong>, duplicated or potentially related samples should be detected and removedbe<strong>for</strong>ehand.• Ambiguous calls Unlike gene expression, which is inherently continuous, the SNParray signal is a continuous representation <strong>of</strong> an underlying discrete allele, and the alleleis determined through genotype calling. The genotype calling process is statistical anddepends on certain assumptions, which are different <strong>for</strong> each array plat<strong>for</strong>m. There<strong>for</strong>e,it is common to discard genotype calls below a confidence <strong>of</strong> 0.9 (posterior probability)and mark them as “no calls”. Samples with many such low-confidence calls shouldbe discarded, however when the proportion <strong>of</strong> missing calls is not too large it may bepossible to statistically impute them from the high-confidence calls.• Sample contamination The DNA samples collected from individuals may be contaminated,either with DNA from other individuals, or with DNA from other organismssuch as bacteria (Bahlo et al., 2010). The <strong>for</strong>mer is less <strong>of</strong> a concern in clinical studies,but is more <strong>of</strong> an issue <strong>for</strong> samples taken from residual tissue such as <strong>for</strong>ensic samples.• Sample mix-up Human error in collecting and handling the samples may createsituations where some samples are mis-labelled, and the DNA does not match with thecorrect person, potentially flipping the phenotype status in case/control studies.• The sex chromosomes Humans have 22 pairs <strong>of</strong> autosomal chromosomes and one pair<strong>of</strong> sex chromosomes, where males have XY and females XX. The female X chromosomeundergoes a process called X-inactivation, where only one chromosome <strong>of</strong> a pair is activein each cell, and the choice <strong>of</strong> which chromosome gets silenced is random. Thus a riskallele in a female X chromosome contributes less risk than an equivalent allele on a maleX chromosome. There<strong>for</strong>e, the sex chromosomes are typically excluded from GWA,unless specialised <strong>approaches</strong> are used (Clayton, 2009b).• Epigenetics Epigenetics refers to heritable mechanisms that are not caused by changesto the DNA sequence itself. Such mechanisms include methylation <strong>of</strong> the DNA (Eckhardtet al., 2006) and histone modifications (Feinberg, 2007; Goldberg et al., 2007),which modulate the ability <strong>of</strong> genes to be transcribed and thus affect gene expression inthe cell. Epigenetic marks are not detected in SNP arrays and can potentially confoundGWAS results.• Computational challenges Gene expression experiments are limited to tens <strong>of</strong> thousands<strong>of</strong> probes and typically include on the order <strong>of</strong> hundreds to a few thousand30


2.4. The Genetic Basis <strong>of</strong> Diseasesamples. There<strong>for</strong>e, most analyses <strong>of</strong> transcriptomic data have been computationallyfeasible, even on commodity computing hardware. In contrast, SNP data is much larger— datasets commonly include upwards <strong>of</strong> 500,000 SNPs, and usually several thousandsamples. Meta-analyses, in which the results <strong>of</strong> several studies are combined togetherto increase statistical power, are larger yet — Zeggini et al. (2008) analysed Type 2diabetes in 2.2 million genotyped and imputed SNPs over more than 10,000 samples,collected from three studies. The size <strong>of</strong> the data means that standard <strong>approaches</strong> <strong>for</strong>analysing the data, such as fitting <strong>of</strong> statistical models, are not practical from eithera time or space complexity perspective. We discuss this issue further in Chapter 3;<strong>for</strong> now, we mention two main characteristics <strong>of</strong> genetic data that make it amenableto <strong>analysis</strong> even <strong>for</strong> large data sizes. The first is the assumption <strong>of</strong> sparsity, whichmeans that we expect the vast majority <strong>of</strong> SNPs not to be causal drivers <strong>of</strong> the phenotype.By taking advantage <strong>of</strong> statistical models that are based on this assumption,we can efficiently fit sparse statistical models to large data. The second characteristicis discreteness <strong>of</strong> the genotypes, which are typically represented in terms <strong>of</strong> dosage <strong>of</strong>the minor allele {0, 1, 2} ( see Section 3.4.1 <strong>for</strong> discussion <strong>of</strong> genotype coding). Thediscrete nature <strong>of</strong> the data allows us to compress or otherwise encode the data so as toreduce computational space requirements, and to accelerate otherwise costly numericalcomputation.2.4.7. The Problem <strong>of</strong> Missing HeritabilityMany GWAS have found SNP-phenotype associations that are highly statistically significant,across diverse phenotypes such as various diseases, metabolite levels, height, and responseto medication 7 . Yet even the most significant SNPs fail to explain a large proportion <strong>of</strong>the observed variability in phenotype (Manolio et al., 2009), despite the fact that many<strong>of</strong> the phenotypes are estimated to be highly heritable, that is, that the phenotype has astrong genetic component. This problem has come to be known as the problem <strong>of</strong> missingheritability. For example, a large meta-<strong>analysis</strong> <strong>of</strong> 16 studies covering more than 38,000samples (Lindgren et al., 2009) investigated SNP associations with central adiposity and fatdistribution in <strong>human</strong>s. They identified three SNPs (rs987237 with p = 4.5 × 10 −9 , rs7826222with p = 1.2 × 10 −8 , and rs6429082 with p = 2.6 × 10 −8 ), as highly significantly associated withcentral adiposity and fat distribution in <strong>human</strong>s. Despite being highly statistically significant,the two most significant SNPs explain only a small proportion <strong>of</strong> phenotypic variance (0.05%and 0.04% <strong>for</strong> rs987237 and rs7826222, respectively), and only confer a very small effect sizein absolute terms — 0.49 cm and 0.43 cm in waist circumference, respectively.Several hypotheses have been suggested as to why many GWAS fail to account <strong>for</strong> much <strong>of</strong>the observed phenotypic variability (Eichler et al., 2010). These include the issue <strong>of</strong> sample7 See http://www.<strong>genome</strong>.gov/gwastudies <strong>for</strong> a curated list <strong>of</strong> GWAS results.31


2. Biological Backgroundheterogeneity, discussed in Section 2.3.2, epigenetics as discussed earlier, and the followingadditional reasons:• Epistasis In a strict statistical sense, epistasis means a non-additive statistical model,that is, where the joint effect <strong>of</strong> two SNPs on the phenotype is different from the sum<strong>of</strong> the two marginal effects <strong>of</strong> the SNPs (Clayton, 2009a; Cordell, 2009; Moore andWilliams, 2009). Since most GWAS employ univariable screening <strong>of</strong> SNPs, it ignorespotential interactions and associations between them. It may be that some sets <strong>of</strong>SNPs have strong joint effects on the phenotype without exhibiting detectable marginaleffects — these would be ignored by a univariable approach as it only considers eachSNP individually.• Rare variants Rare variants are alleles that occur with MAF < 0.5%–5%, and arepotentially not measured by current SNP arrays. Besides being included on the array, orat least being in strong LD with an assayed SNP, what is required to assay rare variantsis larger sample sizes, since the rarer they are, the smaller the chance <strong>of</strong> capturingenough <strong>of</strong> them in the sample in the first place.• Weak variants Weak variants are SNPs that contribute very little to overall diseaserisk, but are not necessarily rare (Manolio et al., 2009). Since they have small effect sizes,together with hybridisation and measurement noise, weak variants may not appear to bestatistically significant in GWAS (low signal to noise ratio), especially after correcting<strong>for</strong> multiple testing. Despite being individually weak, the total contributions from manyweak variants may still be strong determinants <strong>of</strong> disease risk. As with rare variants,larger studies are required to be able to adequately filter the noise and variability andfind the weak variants.• Copy number variation Copy number variation (CNV) refers to segments <strong>of</strong> DNAthat undergo duplication (Freeman et al., 2006; McCarroll and Altshuler, 2007), thuspotentially increasing the cellular expression levels <strong>of</strong> the duplicated genes. Some CNVare known to occur in cancer (somatic CNV), whereas others are inherited. In practice,the contribution <strong>of</strong> CNV to heritable disease is unclear — a large study <strong>of</strong> SNPsand CNVs across eight common multifactorial disease in more than 16,000 samples recentlyconcluded that “CNVs that can be typed on existing plat<strong>for</strong>ms are unlikely tocontribute greatly to the genetic basis <strong>of</strong> common <strong>human</strong> diseases” (Wellcome TrustCase Control Consortium, 2010). Additionally, they found that many <strong>of</strong> the CNV werealready indirectly identified by the SNP <strong>analysis</strong>.• Overestimated heritability Heritability itself is estimated from pedigree data, suchas twin studies, since these allow to control <strong>for</strong> other potentially confounding factorssuch as shared environment. However, the estimates <strong>of</strong> heritability may be estimated32


2.4. The Genetic Basis <strong>of</strong> Diseasewith low precision in some cases, leading to apparent missing heritability when theestimates are high. For example, Nisticó et al. (2006) estimated the heritability <strong>of</strong> celiacdisease in Italian twins as 0.57–0.87, depending on the assumed population prevalence.Heritability may even change over time and between environments (Visscher et al.,2008).2.4.8. Expression Quantitative-Trait LociGenetic factors are known to be important contributors to the observed variability in geneexpression (Cookson et al., 2009; Gilad et al., 2008; Jansen and Nap, 2001), via the mechanism<strong>of</strong> expression quantitative trait loci (eQTL). SNPs having significant associations with geneexpression are candidates <strong>for</strong> being eQTL. We distinguish between cis-QTLs, which are SNPsproximal to a gene, and induce variation in the same gene through affecting its regulatoryregions, modulating the ability <strong>of</strong> transcription factors to bind. There are different definitions<strong>of</strong> what constitutes proximity <strong>of</strong> a SNP to a gene. As a heuristic, to be considered a cis-QTL,a SNP must reside on the same chromosome as the gene, typically within 1Mb <strong>of</strong> the geneboundaries or the transcription start site. QTLs that do not fit this definition are consideredtrans-QTLs. The important distinction is that cis-QTLs operate directly on the proximalgene, whereas the trans-QTLs are potentially mediated by secondary transcription factorsor other downstream effects. Unlike standard case/control phenotypes, which represent onephenotype, gene expression microarrays allow us to measure tens <strong>of</strong> thousands <strong>of</strong> phenotypessimultaneously.Together with SNP data, we can search <strong>for</strong> strong associations between SNPs and gene expression,with the aim <strong>of</strong> understanding the underlying regulatory architecture <strong>of</strong> the genes —how many SNPs regulate them, how they interact (additively, dominant, epistatic), and whichgenes these SNPs represent (Rockman, 2008). eQTL can also be expanded to include phenotypesdownstream <strong>of</strong> gene expression, such as disease state, in which case gene expressionis considered an intermediate phenotype, mediating between SNPs and disease. Understandingthe genetic architecture <strong>of</strong> disease can provide further insight into the driving molecularmechanisms, that would not have been available from examining either gene expression orSNPs in isolation (Barrett et al., 2008; Mackay et al., 2009).All <strong>of</strong> the issues and challenges that we have discussed previously in relation to gene expressionand SNP experiments apply to eQTL studies as well. Additionally, there is the potentialproblem <strong>of</strong> gene expression probes overlapping SNPs, there<strong>for</strong>e introducing spurious correlationsbetween gene expression and the SNP; however, this was not found to be a major source<strong>of</strong> bias in two studies (Doss et al., 2005; Emilsson et al., 2008).Recent examples <strong>of</strong> large-scale eQTL studies include Göring et al. (2007), who analysedeQTLs <strong>for</strong> over 20,000 probes, and identified the gene VNN1 as strongly associated with highdensity lipoprotein cholesterol (HDL-C) levels. Later, Emilsson et al. (2008) studied <strong>human</strong>33


2. Biological Backgroundobesity and related traits such as body-mass index (BMI), using gene expression and SNPdata across adipose tissue and blood; they report detecting almost 3400 gene expression traitswith significant genetic associations in the adipose tissue, after adjusting <strong>for</strong> factors such sex,age, cell counts and BMI. Veyrieras et al. (2008) mapped eQTL in HapMap lymphoblastoidcells, estimating that most eQTL SNPs are indeed cis-QTLs, residing within or near genes,and that SNPs residing in exons are twice as likely to be eQTL than those residing in introns.2.5. SummaryWe have surveyed some <strong>of</strong> the fundamental types <strong>of</strong> molecular biological data: gene expressionand SNPs. Gene expression experiments describe aspects <strong>of</strong> the intra-cellular processes,through measuring mRNA concentrations. In contrast, SNPs experiments describe geneticvariation that is heritable and largely fixed throughout the life <strong>of</strong> the individual. Both haveprovided immense insight into the cellular mechanisms underlying health and disease, however,when employing either approach their limitations must be considered as well. First,there are sources <strong>of</strong> noise and variation that can confound the results <strong>of</strong> the <strong>analysis</strong>. Second,both gene expression and SNP experiments only provide one aspect <strong>of</strong> the underlyingbiology. The true mechanisms are likely to involve complex interactions <strong>of</strong> different effects,such as gene expression, SNPs regulation, epigenetic factors, and environmental factors. Suchintegrative analyses are being pursued, with the aim <strong>of</strong> gaining a more holistic view <strong>of</strong> themolecular processes <strong>of</strong> the cell.34


3Review <strong>of</strong> the Analysis <strong>of</strong> Gene Expression andGenetic Data3.1. IntroductionTwo major ef<strong>for</strong>ts in the <strong>analysis</strong> <strong>of</strong> gene expression and genetic data have been the detection<strong>of</strong> predictive and causal markers <strong>of</strong> phenotype and prediction <strong>of</strong> phenotype based on thesemarkers. The implicit assumption is that genes and SNPs that are highly associated with thephenotype are potentially causal 1 and if not causal then at least predictive <strong>of</strong> the phenotype.Identification <strong>of</strong> predictive markers also provides evidence into the underlying biological mechanismsand <strong>for</strong> diagnostic and prognostic purposes, such as predicting metastasis in breastcancer patients (van ’t Veer et al., 2002). Traditionally, gene expression data and genetic datahave been analysed separately and with different methods, due to different characteristics <strong>of</strong>the data: the genotype <strong>of</strong> an individual is a discrete entity and is largely fixed (barring somaticmutations in cancer), whereas gene expression can be highly dynamic on time scales asshort as minutes. Nonetheless, some general principles <strong>of</strong> statistical inference are applicableto both. In this chapter we survey some <strong>of</strong> the major <strong>approaches</strong> in marker selection andpredictive modelling, contrasting the differences between genetic and gene expression datawhere appropriate.1 We distinguish between association and causality, where the <strong>for</strong>mer refers to a statistical relationship betweentwo phenomena, such as a gene being highly expressed in some disease, but perhaps not the cause <strong>of</strong> thedisease, and the latter which refers to a gene without which disease would not occur (in the simplestsetting). See Pearl (2009) <strong>for</strong> an in-depth discussion <strong>of</strong> causality.35


3. Review <strong>of</strong> the Analysis <strong>of</strong> Gene Expression and Genetic Data3.2. Supervised Machine LearningTwo <strong>of</strong> the main paradigms <strong>of</strong> statistical machine learning are supervised learning and unsupervisedlearning. In supervised learning, we are given training data consisting <strong>of</strong> examples<strong>of</strong> inputs and outputs (also called predictors and responses, respectively), from which we seekto learn input-output relationships within the data; these relationships may be interesting intheir own right, or they may be used <strong>for</strong> predicting the output given a new set <strong>of</strong> inputs. Thesecond paradigm is unsupervised learning, where the data is not labelled (there are inputsonly) and the goal is to find interesting groups or trends in the data. This thesis deals mainlywith supervised methods, there<strong>for</strong>e we will not cover unsupervised methods here.Formally, in the supervised learning setting, we are given some inputs x ∈ X , and someoutputs y ∈ Y, where X and Y are the spaces <strong>of</strong> all inputs and outputs, respectively, andour goal is to find some function f that maps between X and Y “well”, quantified using aloss function L(ŷ, y) that maps from our predicted output ŷ and the true output y to a realpositive number (the loss)L ∶ Y × Y ↦ R + . (3.1)Since the loss is zero only <strong>for</strong> perfect mappings ŷ = y and positive otherwise, it is conventionto minimise the loss 2 ; several concrete examples <strong>of</strong> loss functions will be discussed later. Inorder to minimise the loss, we must find the function f (also called a model) out <strong>of</strong> a set <strong>of</strong>functions H, that minimises the risk R jointly over X and Yinf R(f) = ∫ L(y, f(x)) dP (x, y), (3.2)f∈H X ×Ywhere P is the joint distribution <strong>of</strong> X and Y. The minimum attainable risk is also called theBayes risk, since it is equivalent to the risk from Bayes’ rule when the true distribution P isknown (see below <strong>for</strong> discussion <strong>of</strong> Bayes’ rule).In practice, P is usually not known or is only known approximately, hence the true risk<strong>of</strong> any given model cannot be known. There<strong>for</strong>e, we estimate the risk on our training setconsisting <strong>of</strong> N samplesˆR(f) = 1 NN∑i=1L(y i , f(x i )). (3.3)Moreover, usually our function f is parameterised by some weights (also called parametersor coefficients). For example, a linear model <strong>of</strong> one input x can be parameterised by β,y i = x T i β + ɛ i , i = 1, ..., N, (3.4)2 When the loss function is based on a probabilistic model, that is the loss function is a probability densityfunction or probability mass function, it is also called the (log) likelihood function, and we either maximisethe likelihood or minimise the negative likelihood.36


3.2. Supervised Machine LearningEmpirical riskTrue riskBayes riskRiskModel complexityFigure 3.1.: An illustration <strong>of</strong> the relationship between true and empirical risk as model complexityincreases. The Bayes risk is shown constant since it assumes a fixed modelcomplexity, which is the “correct” (but unknown) model. Empirical risk is therisk observed <strong>for</strong> a given model in a given finite dataset. On the far left handside, the model can be said to be underfitting, as the empirical risk is higher thanthe true risk. On the right hand side, the model is overfitting, as it has lowerempirical risk than the true risk.where y i ∈ R and x i ∈ R p are the ith output and inputs, respectively, β ∈ R p are the modelweights (parameters), and ɛ i ∼ N (0, σ 2 ) is iid Gaussian noise. For a given model, we then findthe weights that minimise the risk in the training set. This principle is called empirical riskminimisation (ERM), and is equivalent to maximum likelihood from statistics. Since samplesizes are finite, it possible that to achieve small empirical risk ˆR while still having large truerisk R. This phenomena is known as overfitting. Generally, overfitting increases as the set <strong>of</strong>functions H becomes richer (that is, model complexity increases) and as the sample size Ndecreases. Intuitively, overfitting means that the function we have chosen f ′ is more complexthan the true function f ∗ and is fitting the noise (random component) in the data rather thanjust the signal (systematic component). Since the noise is random, overfitting will manifestitself by worse (higher) loss over new inputs than in the original data used <strong>for</strong> fitting the model.Overfitting is mitigated in two ways. One is restricting model complexity. The simplest way37


3. Review <strong>of</strong> the Analysis <strong>of</strong> Gene Expression and Genetic Datato restrict model complexity is to use simple models, <strong>for</strong> example only allowing linear models.However, a more flexible approach is to use regularisation (also known as penalisation), whichreduces model complexity and thus limits the ability <strong>of</strong> the models to overfit. One suchpenalisation method is the lasso penalty, discussed in Section 3.4.3. Conversely, if modelcomplexity is too low, that is, the function space H is heavily restricted, then we can onlylearn very simple associations from the data, and potentially miss out on more interestingones that could lead to lower risk — this is known as underfitting. Another way to mitigateoverfitting is to estimate the empirical risk ˆR on an independent test set, different to theone use <strong>for</strong> training, leading to the concepts <strong>of</strong> cross-validation and bootstrapping, discussedlater. This relationship between empirical risk, Bayes risk, and true risk, as a function <strong>of</strong>model complexity, is illustrated in Figure 3.1.3.3. Linear Models and Loss FunctionsAll the models employed in this thesis are based on linear models (Eqn 3.4), or trans<strong>for</strong>mations<strong>of</strong> linear models such as log linear models. The problem <strong>of</strong> fitting linear and log-linear modelsto data can be cast as minimising a convex loss function. Two common loss functions arethe squared loss <strong>for</strong> linear regression and the logistic loss <strong>for</strong> logistic regression, in which casethe loss function corresponds to the negative log <strong>of</strong> the binomial likelihood. The squared lossfunction over N samples in p variables x 1 , ..., x N , isL(β 0 , β) = 1 2N∑i=1(y i − β 0 − x T i β) 2 , (3.5)where y ∈ R is an N-vector <strong>of</strong> outputs, x i ∈ R p is the p-vector <strong>for</strong> the ith sample, β 0 is theintercept, and β ∈ R p is a p-vector <strong>of</strong> model coefficients. Similarly, logistic loss <strong>for</strong> binaryoutcomes y i ∈ {0, 1}, is 3L(β 0 , β) =N∑i=1log(1 + exp(β 0 + x T i β)) − y i (β 0 + x T i β). (3.6)Another loss function useful in classification is squared-hinge loss, which is equivalent to aleast-squares support vector machine (SVM) with a linear kernel (Chang et al., 2008),L(β 0 , β) = 1 2N∑i=1max{0, 1 − y i (β 0 + x T i β)} 2 , y i ∈ {−1, +1}. (3.7)3 An equivalent <strong>for</strong>mulation <strong>of</strong> logistic regression loss is L(β 0, β) = ∑ N i=1 log(1 + exp(y i(β 0 + x T i β))) <strong>for</strong> y i ∈{−1, +1}.38


3.3. Linear Models and Loss FunctionsLoss0 1 2 3 40/1 losslogistic losshinge losssquared hinge loss−1 0 1 2 3zFigure 3.2.: Four loss functions <strong>for</strong> classification: 0/1 loss L(z i ) = I(z i < 1), logistic regressionL(z i ) = log(1 + exp(−z i )), hinge loss L(z i ) = max{0, 1 − z i }, and squared hingeloss L(z i ) = max{0, 1−z i } 2 , where z i = y i (β 0 +x T i β) <strong>for</strong> linear models. In<strong>for</strong>mally,<strong>for</strong> z ≥ 1 the predicted and observed classes match sign(ŷ i ) = sign(y i ) (correctclassification), and <strong>for</strong> z < 1 they do not match (mis-classification).39


3. Review <strong>of</strong> the Analysis <strong>of</strong> Gene Expression and Genetic DataThe squared hinge loss is similar to the standard hinge lossL(β 0 , β) =N∑i=1max{0, 1 − y i (β 0 + x T i β)}, y i ∈ {−1, +1}, (3.8)but the <strong>for</strong>mer (Eqn. 3.7) is twice-differentiable whereas the latter (Eqn. 3.8) is not. Twicedifferentiabilitymeans that the Newton step-size used in coordinate descent (see Section 5.4.1)is the second derivative <strong>of</strong> the loss wrt β j , whereas to use coordinate descent with the hinge losswe must either use some pre-chosen step size, usually tuning it to achieve good convergence,or use a line search procedure. The disadvantage <strong>of</strong> the squared-hinge loss is that it is moresensitive to outliers in the data, with the loss increasing quadratically <strong>for</strong> samples as theymove away from the separating hyperplane (the decision boundary between the two classes),whereas the hinge loss only increases the loss linearly.These three loss functions <strong>for</strong> classification are shown in Figure 3.2. Also shown is the 0/1loss L = ∑ N i=1 I(y i(β 0 +x T i β) < 1), where a constant loss <strong>of</strong> 1 is incurred <strong>for</strong> incorrect classification.The 0/1 loss is non-convex and non-smooth and there<strong>for</strong>e difficult to optimise (due topossibility <strong>of</strong> local minima and zero gradients), and the other loss functions can be consideredto be convex relaxations <strong>of</strong> this loss (Bach et al., 2011), which are easier to optimise sincethey have a global minimum (or several equivalent minima if they are not strictly convex).3.4. Feature Selection — Finding Predictive & Causal MarkersThe term feature selection is used <strong>for</strong> the task <strong>of</strong> finding a subset <strong>of</strong> features that are stronglyrelated to the output, while discarding irrelevant or weakly-predictive inputs. In this thesis,the term feature means either an original input variable, such as a gene or a genotype, ora derived variable after some trans<strong>for</strong>mation; we will use the terms feature and variableinterchangeably. The implicit assumption is that only a relatively small subset <strong>of</strong> featuresare truly associated with phenotype, whereas the remaining ones are spurious, resulting fromrandom noise. The assumption <strong>of</strong> a relatively small set <strong>of</strong> relevant features is reasonable ingene expression and SNP <strong>analysis</strong> since we only expect a small proportion <strong>of</strong> the inputs tobe truly causal. Even when this assumption does not strictly hold, it may still be usefulto per<strong>for</strong>m feature selection in order to reduce models to manageable size, thus increasingbiological interpretability, while maintaining good predictive per<strong>for</strong>mance. If the goal <strong>of</strong> themodelling process is strictly <strong>for</strong> prediction purposes, such as diagnostic or prognostic tests <strong>for</strong>breast cancer metastasis (van ’t Veer et al., 2002), rather than investigating disease etiology,then feature selection can be used to find a small set <strong>of</strong> predictive markers, even though theymay not necessarily be causally linked with disease but only statistically associated with it,as demonstrated in the toy example in Figure 3.3. As long as these associations are robust(rather than spurious), they are useful <strong>for</strong> the task <strong>of</strong> prediction.40


3.4. Feature Selection — Finding Predictive & Causal MarkersACBphenotypeFigure 3.3.: A toy example <strong>of</strong> a three-gene network. If gene A is mutated and causes adownstream effect in genes B and C, then all three genes may appear to beassociated with the phenotype, even though gene B is clearly non-causal.It is convenient to group feature selection methods into three main <strong>approaches</strong> – filtermethods, wrapper methods, and embedded methods (Guyon and Elisseeff, 2003).3.4.1. Filter MethodsIn the filter approach, a simple test statistic is used to pre-screen (filter) the inputs prior t<strong>of</strong>itting an overall model to the remaining data. Since gene expression is continuous whereasSNPs are discrete, different filters are used <strong>for</strong> each and we review them separately.Filters <strong>for</strong> Gene ExpressionIn gene expression, some <strong>of</strong> the common filtering methods arethe t-test, Pearson correlation, and the signal-to-noise ratio (SNR).The two-sample t-test is a statistical test <strong>of</strong> differential expression <strong>for</strong> each gene betweentwo groups, such as in case-control studies. Several variants exist, depending on whether theassumptions <strong>of</strong> equal sample sizes and equal variances <strong>for</strong> both classes are used. The simplestvariant, which assumes equal sample sizes and equal variances, ist =¯x 1 − ¯x 2√ , (3.9)(s21+ s 2 2)/Nwhere ¯x k and s 2 kare the sample mean and variance <strong>for</strong> the kth group (k = 1 or k = 2),respectively, and N is the sample size. A p-value is derived by comparing the t statisticagainst a t-distribution with 2N − 2 degrees <strong>of</strong> freedom. More recent variants <strong>of</strong> the t-testinclude the moderated t-test (Smyth, 2004), which is an empirical Bayesian approach wherethe variance estimate is pooled over the genes, in order to reduce estimation variability.Once the p-value has been computed <strong>for</strong> each gene, some cut<strong>of</strong>f is applied, such that geneswith p-values below the cut<strong>of</strong>f are taken to be significantly differentially-expressed. The cut<strong>of</strong>fcan be arbitrary, simply to reduce the number <strong>of</strong> genes, or can be decided using corrections41


3. Review <strong>of</strong> the Analysis <strong>of</strong> Gene Expression and Genetic Data<strong>for</strong> multiple testing, such as the Bonferroni correction, which is simply multiplying the nominalp-value by the number <strong>of</strong> tests, or by some variant <strong>of</strong> the false discovery rate (FDR)approach (Benjamini and Hochberg, 1995; Storey and Tibshirani, 2003).In filtering with Pearson correlation r, the correlation <strong>of</strong> each gene x with the phenotype yis computed asr =1N−1 ∑N i=1 (x i − ¯x)(y i − ȳ)s x s y, (3.10)where ¯x and ȳ are the means <strong>of</strong> the gene x and the phenotype y, respectively, and s x and s yare the sample standard deviation <strong>of</strong> x and y, respectively. Genes with small correlation (inabsolute terms) are removed. Similarly, the SNR, as defined by Golub et al. (1999), isSNR = ¯x 1 − ¯x 2s 1 + s 2, (3.11)where s k is the standard deviation <strong>for</strong> the kth group. Genes with low SNR are removed.The cut<strong>of</strong>f can be determined by permutation tests, where the phenotype labels are randomlypermuted multiple times but the gene expression data is held fixed. In each permutation,different cut<strong>of</strong>fs are assessed, each inducing a certain number <strong>of</strong> false positives, thus estimatingthe null distribution <strong>of</strong> cut<strong>of</strong>fs, that is, the distribution <strong>of</strong> cut<strong>of</strong>fs when there is no trueassociation <strong>of</strong> the gene expression values with the phenotype. Finally, a cut<strong>of</strong>f that induceda sufficiently low false positive rate in the permuted data is applied to the original data.Filters in GeneticsDue to the discrete nature <strong>of</strong> genetic data, different filters have beenused <strong>for</strong> filtering SNPs. We first discuss filters <strong>for</strong> binary phenotypes, most commonly seenin case-control datasets.The simplest approach is the allelic association test, that tests whether any single allele(rather than the genotype as a whole) is associated with the phenotype. This is the --assoctest used by the <strong>wide</strong>ly-used tool PLINK (Purcell et al., 2007) 4 . To understand the allelictest, consider the contingency table in Figure 3.4a, where the alleles are tabulated againstthe case-control status y. Note that the genotypes are not considered in this test, but rathereach <strong>of</strong> the two alleles <strong>of</strong> a genotype is considered individually, thus doubling the samplesize from N to 2N. Under the null hypothesis, both alleles appear in the same proportionsbetween cases and controls. Deviations from the expected counts are quantified using the X 2statistic summed over the 2 × 2 matrix <strong>of</strong> observed counts OX 2 2 2(O ij − E ij ) 2= ∑ ∑, (3.12)E iji=1 j=14 http://pngu.mgh.harvard.edu/purcell/plink42


3.4. Feature Selection — Finding Predictive & Causal MarkersCases y = 1 Controls y = 0Allele x = 1 n 11 n 12Allele x = 2 n 21 n 22(a) countsCases y = 1 Controls y = 0Allele x = 1 n 11 /(n 11 + n 12 ) n 12 /(n 11 + n 12 )Allele x = 2 n 21 /(n 21 + n 22 ) n 22 /(n 21 + n 22 )(b) Conditional probabilitiesAllele x = 1Allele x = 2Odds = Pr(y = 1∣x)/Pr(y = 0∣x)(c) Oddsn 11 /(n 11 +n 12 )n 12 /(n 11 +n 12 )n 21 /(n 21 +n 22 )n 22 /(n 21 +n 22 )Figure 3.4.: A contingency table <strong>of</strong> two alleles versus the case-control status, in terms <strong>of</strong> (a)counts, (b) conditional probabilities Pr(y∣x), and (c) the odds.where O ij and E ij are the observed and expected counts <strong>for</strong> the cell in the ith row and jthcolumn, respectively. The expected counts are given by the product <strong>of</strong> the marginalsE ij = (n 1j + n 2j )(n i1 + n i2 ), i, j = 1, 2. (3.13)The X 2 statistic is tested <strong>for</strong> significance by comparing it with the χ 2 distribution withone degree <strong>of</strong> freedom. Note that the allelic test depends on Hardy-Weinberg equilibrium(Section 2.4.2) (essentially, independence between the frequencies <strong>of</strong> the two alleles), otherwiseit may incur an increased false positive rate.For case-control studies, a common measure <strong>of</strong> association between SNP and phenotype isthe odds ratio (OR), which is the ratio <strong>of</strong> the odds <strong>of</strong> the phenotype in the cases to the odds<strong>of</strong> the phenotype in the controls, and is common measure <strong>for</strong> the strength <strong>of</strong> the association.The odds <strong>of</strong> a binary event y ∈ {0, 1} are given byOdds(y = 1) =Pr(y = 1)Pr(y = 0) = Pr(y = 1)1 − Pr(y = 1) . (3.14)An odds <strong>of</strong> 1 means that both events have same the probability <strong>of</strong> occurring. An odds > 1means that event y = 1 has higher probability, and vice-versa. Figure 3.4b shows the samecounts in the <strong>for</strong>m <strong>of</strong> conditional probabilities Pr(y∣x), and the odds are shown in Figure 3.4c.43


3. Review <strong>of</strong> the Analysis <strong>of</strong> Gene Expression and Genetic DataThe OR is estimated asÔR = n 11/(n 11 + n 12 )n 12 /(n 11 + n 12 ) /n 21/(n 21 + n 22 )n 22 /(n 21 + n 22 ) = n 11n 22n 12 n 21. (3.15)An OR > 1 means that the odds <strong>of</strong> an event (such as having a disease) are higher in onegroup, <strong>for</strong> example in the cases, than in the other group, in this case the controls. An OR< 1 has the opposite interpretation. Note that <strong>for</strong> genetic data, the direction <strong>of</strong> the effect isarbitrary, as it depends on the coding <strong>of</strong> the alleles — which allele is used as the referenceallele. There<strong>for</strong>e, both SNPs with high ORs and with low ORs are potentially interesting.The p-value obtained from the χ 2 test is symmetric — an odds ratio <strong>of</strong> 2 has the same p-valueas an odds ratio <strong>of</strong> 0.5.Another test <strong>for</strong> association in case-control studies is the per-genotype test, in which thegenotype (the two alleles) is considered as unit <strong>of</strong> observation rather than each allele separately.Using genotypes rather than allele counts opens up the possibilities <strong>of</strong> richer models,such as additive, dominant, recessive, and others.The model can be specified using thegenotype coding. Assuming that the minor allele is denoted ‘A’, the three genotypes ‘aa’,‘Aa’, and ‘AA’, can be have the following coding schemes. Additive models (also called trendmodels) are coded as {0, 1, 2}, denoting the dosage <strong>of</strong> the minor allele ‘A’. Dominant modelsassume that the effect <strong>of</strong> the minor allele is dominant, namely, the effect <strong>of</strong> one minor allele isequivalent to that <strong>of</strong> two minor alleles, and are there<strong>for</strong>e coded as {0, 1, 1}. Recessive modelsassume that there is no phenotypic effect unless there are two minor alleles, and are codedas {1, 0, 0}. Genotypic models do not assume any specific relationship between the number<strong>of</strong> alleles and the phenotypes, rather each genotype is treated separately (in statistics this iscalled a level in a factor), requiring binary encoding such as {00, 01, 10}.The tests <strong>for</strong> association within each model are slightly different: <strong>for</strong> the additive model, aCochran-Armitage test <strong>for</strong> trend is per<strong>for</strong>med (Clarke et al., 2011),T 2 =[∑ 3 i=1 w i(n 1i n 2⋅ − n 2i n 1⋅ )] 2n 1⋅ n 2⋅n[∑ 3 i=1 w2 i n ⋅i(n − n ⋅i ) − 2 ∑ 2 i=1 ∑ 3 j=i+1 w iw j n ⋅i n ⋅j , ]where the subscript ⋅ represents summation along a row or column (n 1⋅ is total along row 1,n ⋅1 is total along column 1), and w = (w 1 , w 2 , w 3 ) is the genotype coding, (0, 1, 2) <strong>for</strong> theadditive model. The T 2 statistic is compared to the 1-df χ 2 distribution to derive statisticalsignificance. For the dominant and recessive models, a 2 × 2 contigency table can be constructedand tested using a 1-df χ 2 -test. For the genotypic test, a 3 × 2 contingency table isused, and tested <strong>for</strong> significance using a 2-df χ 2 -test.The final class <strong>of</strong> tests are model-based tests, commonly including logistic and linear regression.Typically, such tests include only one SNP as an variable, however, other externalcovariables such as sex and age can be included (see Chapter 6 <strong>for</strong> discussion <strong>of</strong> multi-SNP44


3.4. Feature Selection — Finding Predictive & Causal Markersmodels.) The logistic model assumes that the log-odds <strong>of</strong> the phenotype y is a linear function<strong>of</strong> the jth genotype and other covariables, namely,log Pr(y i = 1∣x ij )Pr(y i = 0∣x ij ) = β 0j + β 1j x ij + α j z 1 + α 2 z 2 + ... + α k z k , i = 1, ..., N, (3.16)where x ij is the ith genotype <strong>for</strong> the jth SNP (coded using one <strong>of</strong> the additive, dominant,recessive, or other models), y i is the ith case/control phenotype, β 0j and β 1j are the interceptand the regression coefficient <strong>for</strong> the j genotype, respectively, and α 1 , ..., α k are optionalcoefficients <strong>for</strong> k external variables z such as sex and age. Similarly, <strong>for</strong> continuous phenotypes,a common model is the linear modely i = β 0j + β 1j x ij + α 1 z 1 + α 2 z 2 + ... + α k z k + ɛ ij , ɛ ∈ N (0, σ 2 j ), i = 1, ..., N. (3.17)In both cases, a p-value <strong>for</strong> the association between each SNP and the phenotype is used t<strong>of</strong>ilter the SNPs, based on the t statistic in the linear regression case and on the approximate z-statistic (Wald test) or likelihood-ratio test in the logistic regression case.3.4.2. Wrapper MethodsThe second type <strong>of</strong> feature selection methods are wrapper methods, which are “black box”<strong>approaches</strong>, in that they attempt explore the search space <strong>of</strong> all possible features, fitting amodel to each feature subset. The model is evaluated using some measure, such as predictiveper<strong>for</strong>mance on a test set, which is used to guide the search towards better models. The searchstrategy must be chosen as well, <strong>for</strong> example, greedy <strong>for</strong>ward search in which the feature thatadds the most to predictive per<strong>for</strong>mance is included in the model, one at a time, or greedybackwards search where we start with all variables and then repeatedly drop the variable thatdegrades per<strong>for</strong>mance least.The main advantage <strong>of</strong> wrapper methods over filter methods is that they fit an entire modelto the data, not assuming independence <strong>of</strong> the inputs, whereas filter methods examine theassociation <strong>of</strong> each variable with the phenotype separately, thus implicitly assuming that theeffect <strong>of</strong> one variable is independent <strong>of</strong> the effect <strong>of</strong> another, which is clearly not the case withgene expression and SNP data. With wrapper methods, correlations or other interactionsbetween the inputs can be taken into account, which would have been missed when filteringeach input separately. Another advantage is that wrapper <strong>approaches</strong> allow flexibility, inthat they be applied to any existing type <strong>of</strong> model, such as linear and logistic regression andsupport-vector machines.Wrapper methods have two main disadvantages. The main disadvantage is that they arecomputationally expensive (NP hard <strong>for</strong> exhaustive search), since the space <strong>of</strong> all featurecombinations is exponentially large in the number <strong>of</strong> possible features (2 p combinations <strong>for</strong> p45


3. Review <strong>of</strong> the Analysis <strong>of</strong> Gene Expression and Genetic Datafeatures). Heuristics are <strong>of</strong>ten used to circumvent this problem. Computational complexity iscompounded by the need to fit different models many times over. This is especially a problem<strong>for</strong> large datasets, which are common in genomics. There<strong>for</strong>e, wrapper methods have not been<strong>wide</strong>ly applied in this area. Second, predictive ability may not be a sensitive indicator <strong>of</strong> theimportance <strong>of</strong> a variable to the model, especially when the model already includes severalstrong variables — the addition or removal <strong>of</strong> one variable might not change the predictiveper<strong>for</strong>mance substantially, in which case it is not clear whether to retain the variable or toexclude it.3.4.3. Embedded MethodsIn contrast with wrapper methods, embedded methods per<strong>for</strong>m feature selection as part <strong>of</strong> themodel fitting process, using the weights estimated <strong>for</strong> each feature as the basis <strong>for</strong> inclusionin the final model, rather on than the overall model predictive per<strong>for</strong>mance. Two <strong>of</strong> the bestknown embedded <strong>approaches</strong> are the recursive feature elimination (RFE) (Guyon et al., 2002)and the lasso (Tibshirani, 1996).Recursive Feature Elimination RFE is conceptually similar to a wrapper method in that amodel is fit to a subset <strong>of</strong> inputs and its predictive ability is then evaluated. However, thedifference is that instead <strong>of</strong> per<strong>for</strong>ming a search in the space <strong>of</strong> all feature combinations andthus deriving an importance <strong>for</strong> each feature, the importance <strong>of</strong> each feature is derived from its(absolute) weight in the model, <strong>for</strong> example, the regression coefficient in a logistic regression.In addition, RFE starts from the full model and successively removes features. For example,in a linear regression with 100 possible features, a model is fit to all 100 features. Predictiveper<strong>for</strong>mance (R 2 in this case) is used to assess the model. Then the feature with the smallestregression coefficient β is removed, and the model is re-fit. This process is repeated until n<strong>of</strong>eatures are left (to assess the contribution <strong>of</strong> each feature to the model) or alternatively untilper<strong>for</strong>mance degrades (to find the best predictive subset). RFE can be applied to any modelthat maintains an internal representation <strong>of</strong> each feature’s importance, such as support vectormachines (with small modifications necessary <strong>for</strong> non-linear kernels), and linear and logisticregression.As with wrapper methods, RFE can be computationally expensive since a separate modelis fit to each set <strong>of</strong> features. In addition, there are many schemes <strong>for</strong> removing features, suchas one at a time and several in one batch. It is not clear whether removing the features withthe smallest weight in the model is always a sensible approach — <strong>for</strong> example, some featuresmay have small weights but small variance as well, whereas other have higher weights butlarger variance, in which case the features with the lower variance may be more useful tokeep. Scaling the features to unit variance may help in this matter.46


3.4. Feature Selection — Finding Predictive & Causal MarkersThe LassoThe lasso (Hastie et al., 2009a; Tibshirani, 1996) is an approach to fitting modelsthat are penalised by the sum <strong>of</strong> the absolute value <strong>of</strong> the weights, and it per<strong>for</strong>ms featureselection as part <strong>of</strong> the fitting.To motivate the lasso, we first discuss the concept <strong>of</strong> penalised loss functions. As mentionedearlier, many statistical models can be expressed as the minimising solutions to a loss function.Generalised linear models (GLM), which include linear and logistic regression, are fit using theprinciple <strong>of</strong> maximum likelihood, and <strong>for</strong> them the loss function is the negative log likelihood.In the case <strong>of</strong> support vector machines, the loss function is the hinge loss.Taking linear regression as a concrete example, the loss function L over N samples and pvariables is equivalent to the sum <strong>of</strong> squares <strong>of</strong> the residualsL(x, y, β) = 1 2N∑i=1(y i − x T i β) 2 , (3.18)where x i is p vector <strong>of</strong> the ith sample <strong>of</strong> each input variable, y i ∈ R is the output variable,and β ∈ R p is the p vector <strong>of</strong> regression coefficients (we omit the intercept here <strong>for</strong> notationalconvenience). The solutions that minimise the negative log likelihood are the same as themaximum likelihood estimates β ∗ .The lasso penalised loss ispL ′ = L + λ∣∣β∣∣ 1 = L + λ ∑ ∣β j ∣, (3.19)where λ ≥ 0 is a user-determined parameter that determines the degree <strong>of</strong> penalisation, and∣∣⋅∣∣ 1 is the l 1 -norm. As λ → 0, the solution β ∗ <strong>approaches</strong> the unpenalised maximum likelihoodsolutions. The lasso has the effect that <strong>for</strong> high values <strong>of</strong> λ, some <strong>of</strong> the variables are set exactlyto zero, and the non-zero variables are shrunk towards zero.Another common penalisation method is l 2 penalisation, also called ridge regression, expressedasj=1pL ′ = L + λ∣∣β∣∣ 2 2 = L + λ ∑ βj 2 , (3.20)where the penalty is the squared l 2 -norm <strong>of</strong> the coefficients. In contrast with l 1 penalisation,l 2 penalisation does not generally induce sparse models, that is, all coefficients in an l 2 aretypically non-zero (except in unrealistic scenarios with zero noise). Hence, l 2 penalisationis not sufficient <strong>for</strong> feature selection, and must be augmented with another method such asRFE.We defer discussion <strong>of</strong> how lasso models are fit to data to Chapter 6. For now, we discussthe feature selection properties <strong>of</strong> the lasso. For an intuitive explanation <strong>for</strong> why lasso setssome variables to zero exactly, we recast the penalty <strong>for</strong>mulation (also called the Lagrangianj=147


3. Review <strong>of</strong> the Analysis <strong>of</strong> Gene Expression and Genetic Dataβ 2 β 2β*β*β 1β 1Figure 3.5.: Penalised squared loss in 2 dimensions. The red contours show the curves <strong>of</strong> constantloss <strong>for</strong> different solution pairs (β 1 , β 2 ), and β ∗ is the unpenalised solution.Also shown are the feasible regions (in cyan) imposed by the ridge constraintsβ 2 1 + β2 2 ≤ t (left), and by the lasso constraints ∣β 1∣ + ∣β 2 ∣ ≤ t (right). Adaptedfrom Hastie et al. (2009a).<strong>for</strong>mulation) as the equivalent constrained optimisation <strong>for</strong>mulationˆβ ∗ = arg minβL, subject top∑j=1∣β j ∣ ≤ t, (3.21)where t ≥ 0 is the constraint on the l 1 -norm <strong>of</strong> the coefficients. As shown <strong>for</strong> 2 dimensionsin Figure 3.5, the feasible region <strong>for</strong> the lasso solution is given by ∣β 1 ∣ + ∣β 2 ∣ ≤ t, which is aconvex set (but not strictly convex). The solution to the penalised problem is at the firstintersection <strong>of</strong> the non-penalised loss with the feasible region. If the solution is at the corner<strong>of</strong> the feasible region, then one <strong>of</strong> the coefficients is exactly zero. These corners occur more<strong>of</strong>ten with non-differentiable penalties such as the l 1 -norm than with differentiable penaltiessuch as the l 2 -norm (Hastie et al., 2009a).More rigorous analyses <strong>of</strong> the lasso’s asymptotic behaviour (Knight and Fu, 2000; Zhao andYu, 2006) have shown that under the conditions <strong>of</strong> irrepresentability, loosely interpreted as thecondition that the covariance between the relevant and irrelevant variables is not too large,and that the true number <strong>of</strong> relevant variables is small enough, the lasso is consistent, in thesense that as the number <strong>of</strong> samples N → ∞, the lasso recovers the true non-zero variables (thesupport) with probability one (sometimes called sparsistency). In practice, <strong>for</strong> a given datasetwe do not know whether this condition holds, since we do not know the relevant and irrelevantvariables in the first place, hence we cannot be certain that the lasso only identifies relevantvariables. This problem can be mitigated by schemes such as stability selection (Meinshausenand Bühlmann, 2006), which is essentially a multiple resampling procedure <strong>for</strong> determining48


3.4. Feature Selection — Finding Predictive & Causal Markersthe relevance <strong>of</strong> each variable.The main advantages <strong>of</strong> lasso-based feature selection are that the lasso problem is a convexoptimisation problem (as long as the loss function is convex as well), allowing us to usethe extensive toolkit <strong>of</strong> convex optimisation to solve it efficiently. Taking advantage <strong>of</strong> themodel’s sparsity allows development <strong>of</strong> fast algorithms, such as coordinate descent, discussedin Chapter 6. As the sparsity is induced by the model fit process itself, there is no need touse external <strong>approaches</strong> such as RFE or wrappers.Beyond the LassoThe l 1 norm is one instance <strong>of</strong> the l p norms, defined asp∣∣β∣∣ p = ⎛ ∑ ∣β j ∣ p⎞ ⎝ ⎠j=11/p, p ≥ 1. (3.22)The l 1 norm is the only convex norm that induces sparse models, since it is non-differentiableat 0. However, the lasso has the sometimes undesirable side effect that non-zero variables areshrunk towards zero too much (high bias) (Zhao and Yu, 2006).There also exist penalisation methods based on quasi-norms, where p > 0 is used 5 . In contrastwith the l 1 -norm, norms with 0 < p < 1 induce stronger model sparsity, while shrinkingthe non-zero variables less. However, such norms are not convex (the problem does not havea global solution) and cannot be solved using standard convex optimisation toolsAnother improper norm is the l 0 -norm, which is equivalent to the number <strong>of</strong> non-zeroelements in a vector∣∣β∣∣ 0 =p∑i=1I(β j ≠ 0), (3.23)where I(⋅) is the indicator function, 1 <strong>for</strong> true and 0 <strong>for</strong> false. The l 0 -norm is useful in thatit allows constraining the model to a predefined number <strong>of</strong> non-zero variables. However, thel 0 -norm is non-convex and non-differentiable, and exactly solving an l 0 -norm constrainedproblem in p variables is equivalent to a combinatorial optimisation problem with 2 p combinations,making it NP-hard, and several approximations to the l 0 norm exist (Weston etal., 2003). Solving the l 1 -norm constrained problem can be seen as a convex relaxation: atractable approximation <strong>of</strong> the l 0 -norm problem (Bach et al., 2011).The final norm we briefly mention is the l ∞ -norm, which is convex, and is defined as∣∣β∣∣ ∞ = maxj=1,...,p ∣β j∣. (3.24)The l ∞ -norm is useful <strong>for</strong> penalising groups <strong>of</strong> variables together.Apart from simple norms, hybrid norms can be also be used <strong>for</strong> penalisation. One suchmethod is the elastic net (Zou and Hastie, 2005), which combine both the l 1 and the l 25 When p < 1, ∣∣ ⋅ ∣∣ p is not strictly a norm, but a quasi-norm, since the triangle inequality ∣∣x + y∣∣ p ≤ ∣∣x∣∣ p + ∣∣y∣∣ pis not satisfied.49


3.4. Feature Selection — Finding Predictive & Causal Markersthey allow a hierarchical structure (Gelman and Hill, 2007), where several layers <strong>of</strong> the datacan each get their own distribution. For example, in a case-control study, we can set up twoprior distributions, one <strong>for</strong> cases and one <strong>for</strong> controls, which in turn determine the parameters<strong>of</strong> the distributions <strong>of</strong> each gene.The Bayesian approach is not without its limitations. First, when analysing data withlittle domain knowledge, it may be difficult to set an in<strong>for</strong>mative prior, and a vague (nonin<strong>for</strong>mative)prior is used, <strong>of</strong>ten chosen using cross-validation which makes the process verysimilar to frequentist penalisation methods. Second, most priors used are chosen from a smallset <strong>of</strong> convenient analytical <strong>for</strong>ms (<strong>for</strong> example, the normal distribution <strong>for</strong> continuous variables,the Dirichlet prior <strong>for</strong> binomial variables) <strong>for</strong> which closed-<strong>for</strong>m posterior distributionscan be derived and solved efficiently. When analytical solutions are not available, stochasticmethods such as Markov-chain Monte-Carlo (MCMC) methods are usually employed, whichcan be time-consuming <strong>for</strong> large datasets and may require expert tuning to determine convergence.It is unclear how well MCMC can be applied to inference on the <strong>genome</strong>-<strong>wide</strong> scalewith current commodity computing hardware.3.4.4. Other Methods <strong>for</strong> Dimensionality ReductionAn alternative approach to feature selection is to per<strong>for</strong>m dimensionality reduction on thedata, such that the data is “compressed” from its original high dimensions to lower dimensions,making it more amenable <strong>for</strong> analysing using standard classification and regression tools. Onecommon approach is Principal Component Analysis (PCA) (see, <strong>for</strong> example, Hastie et al.(2009a)), that is based on the singular value decomposition <strong>of</strong> the data X ∈ R N×pX = UDV T , (3.27)where U ∈ R N×k and V ∈ R p×k are orthogonal matrices with columns that are the eigenvectors<strong>of</strong> XX T and X T X, respectively, D ∈ R k×k is a diagonal matrix consisting <strong>of</strong> the square root<strong>of</strong> the eigenvalues <strong>of</strong> X T X and XX T (same eigenvalues in both cases), and k = min{N, p}.To obtain a dimensionality reduction, we project the data onto the eigenvectorsX ′ = XV. (3.28)We can select as many columns <strong>of</strong> X ′ as required: the first k columns <strong>of</strong> X ′ (the principalcomponents) are the best rank-k approximation <strong>of</strong> the original data X, and this submatrix canbe used as input <strong>for</strong> any classification or regression method in place <strong>of</strong> the original data. Eachprincipal component explains some proportion <strong>of</strong> the variance in X, such that progressivelyusing more principal components explains more variation in X.While PCA is convenient <strong>for</strong> reducing the dimensionality <strong>of</strong> large datasets, it has severaldisadvantages. First, the number <strong>of</strong> principal components (PCs) to use must be determined51


3. Review <strong>of</strong> the Analysis <strong>of</strong> Gene Expression and Genetic Datasomehow. A common way to decide is to examine the proportion <strong>of</strong> variance explained byeach subequent principal component, and cut the number <strong>of</strong> PCs when a substantial amount<strong>of</strong> variance has been explained, or when the increase in variance has plateaued (“knee <strong>of</strong>the curve”). Second, PCA is an unsupervised method, and while it is useful <strong>for</strong> explainingvariation in X, this variation may not necessarily be useful <strong>for</strong> predicting some output y.Semi-supervised PCA has been proposed to take into account useful variation (Bair andTibshirani, 2004). Third, interpretation <strong>of</strong> the PCs is less intuitive than that <strong>of</strong> the originalinputs, since the PCs are linear combinations <strong>of</strong> the original variables.3.4.5. Other Methods <strong>for</strong> Classification and RegressionTwo other common <strong>approaches</strong> to predictive modelling include random <strong>for</strong>ests and boosting.We survey them briefly.Random Forests There are two essential components to the Random Forests (RF) (Breiman,2001) method: trees and bootstrap aggregation (bagging, Breiman (1996)). The methodrelies on generating a large number <strong>of</strong> trees, each on a slightly different version <strong>of</strong> the data(achieved by resampling the data with replacement), and then averaging over these trees(bagging), in order to reduce the variance and achieve a single predictor with low bias andlower variance (Hastie et al., 2009a). A basic algorithm <strong>for</strong> inducing an RF classifier is givenin Hastie et al. (2009a), assuming a user-chosen number <strong>of</strong> trees B:1. For b = 1 to B:• Draw a bootstrap sample (with replacement) from the data, <strong>of</strong> the same size asthe original data• Induce a tree T b using the sampled data, by repeating the following process <strong>for</strong>each node in the tree until a minimum node size is achieved, or in classificationuntil each node is “pure” (contains only one class):– Select m variables at random from a subset <strong>of</strong> the m try ≤ p variables– Out <strong>of</strong> the m variables, pick the best variable in terms <strong>of</strong> splitting <strong>of</strong> the data,based on some measure such as classification error rate– Split the node into two child nodes based on the selected variable2. Return the set <strong>of</strong> trees {T 1 , ..., T B }Using a subset <strong>of</strong> variables m try < p leads to lower correlation between the trees, and consequentlyto less overall variance in the bagging step and hence to better bagged models thanusing all p variables at each split. An estimate <strong>of</strong> the expected generalisation predictive per<strong>for</strong>mance<strong>of</strong> the RF model is given by the out-<strong>of</strong>-bag (OOB) estimate: at each step, we can52


3.4. Feature Selection — Finding Predictive & Causal Markersestimate the error <strong>for</strong> the model trained on the bootstrap sample by using it to predict thedata that was not selected, a process similar to cross-validation.For classification <strong>of</strong> a new sample, the data is fed to each tree T b , and the final predictionis a simple majority vote over the B predictions <strong>for</strong> each sample.Boosting and Gradient Boosting Machines As in Random Forests, the boosting approach(Freund and Schapire, 1997) relies on multiple classifiers (called learners) which are combinedto produce an ensemble predictor. However, there are substantial differences between the two<strong>approaches</strong>. In contrast with RF, where the base classifiers are large trees (leading to lowbias but high variance), in boosting the base classifiers are weak, capable <strong>of</strong> predicting theoutcome with probability only slightly higher than chance. Roughly speaking, the basicboosting approach, exemplified in the original algorithm AdaBoost (Freund and Schapire,1997), iteratively adds these weak classifiers to the model, while upweighting the importance <strong>of</strong>samples that were incorrectly classified in the previous round. When the algorithm terminates(<strong>for</strong> example, after a fixed number <strong>of</strong> iterations), the final model is a weighted combination<strong>of</strong> the individual classifiers weighted by their error induced on the data in their respectiveiteration.Later, Friedman et al. (2000) described the deep links between boosting and optimisation<strong>of</strong> an exponential loss function, opening the way to expanding the boosting approach to otherloss functions such as logistic loss (logitboost), linear loss, and others. A regression tree isused as the learner in each iteration, and the next tree is regressed on the residuals fromthe previous iteration. A modern approach to boosting is the Gradient Boosting Machine(GBM) (Friedman, 2001; Ridgeway, 1999), which is a fast and flexible method, allowing <strong>for</strong>use <strong>of</strong> multiple loss functions and exploration <strong>of</strong> interaction terms (ANOVA decomposition),which may be useful <strong>for</strong> considering epistasis between SNPs. GBM-like <strong>approaches</strong> <strong>for</strong> <strong>analysis</strong><strong>of</strong> genomic data have been proposed <strong>for</strong> <strong>analysis</strong> <strong>of</strong> gene expression data (Luan and Li, 2008;Wei and Li, 2007) and more recently <strong>for</strong> SNP data (Cosgun et al., 2011; Ogutu et al., 2011).3.4.6. DiscussionWe have presented some <strong>of</strong> the main <strong>approaches</strong> <strong>for</strong> supervised learning and feature selection,with an emphasis on penalised methods, particularly the lasso. The lasso approach assumesthat the inputs are truly sparse, in contrast to methods such as ridge regression where eachinput variable gets some non-zero weight. Hastie et al. (2009a) discuss what they call “bettingon sparsity”: in high dimensional data N ≪ p, which is commonly the case in gene expressionand genetic data, if the data are truly sparse in the input variables then lasso-type <strong>approaches</strong>will tend to do better than non-sparse methods, whereas if the data are truly non-sparse thenboth types <strong>of</strong> methods will tend to do badly. There<strong>for</strong>e, they advocate using sparse methods<strong>for</strong> high dimensional data. An important practical advantage <strong>of</strong> sparsity-based methods like53


3. Review <strong>of</strong> the Analysis <strong>of</strong> Gene Expression and Genetic Datathe lasso is that assuming sparsity simplifies computations greatly. Lasso penalised models canbe fit efficiently using coordinate descent, enabling <strong>analysis</strong> <strong>of</strong> large SNP datasets consisting<strong>of</strong> thousands <strong>of</strong> samples and hundreds <strong>of</strong> thousands <strong>of</strong> SNPs, as discussed in Chapter 6.The standard lasso method has some drawbacks. First, the lasso tends to arbitrarily selectone variable out <strong>of</strong> a set <strong>of</strong> highly correlated variables, in contrast with l 2 penalty thatwill assign all such variables a non-zero weight, albeit with potentially different signs. Thusinterpretation <strong>of</strong> the weights is different between the two methods — with the lasso, a zeroweight may be given to a variable that is associated with the outcome but is also highlycorrelated with a variable already in the model. This does not imply that the first variableis not important, only that it is redundant in explaining the output (does not further minimisethe loss subject to the lasso constraints). This is important when comparing markerlists generated from different datasets since there might seem to be little overlap between theselected markers, when in fact they are conveying the same in<strong>for</strong>mation. Second, the lassocan also break down, in the sense <strong>of</strong> including too many non-relevant variables in the model,when the truly associated variables are highly correlated with the spuriously-associated variables.Finally, the lasso penalty is not directly interpretable, and is usually set throughcross-validation. The Bayesian <strong>approaches</strong> allows <strong>for</strong> more principled ways <strong>of</strong> determiningpriors, such as prior biological knowledge (Kim and Xing, 2011). However, the computationalcost <strong>of</strong> Bayesian methods is much higher in comparison — Kim and Xing (2011) report fittingmodels to 100,000 SNPs in one day, whereas lasso models can be fit such models in minutesor seconds less on commodity hardware. Ultimately, all statistical models are simplifications<strong>of</strong> complex biological reality. The assumptions behind our models can and should be checked,however, models can still be useful and biologically in<strong>for</strong>mative even if not completely correct.3.5. Feature Selection and Multivariable Models <strong>of</strong> Genetic DataAs discussed in Chapter 2, genetic research has largely evolved independently <strong>of</strong> researchin gene expression, with different analytical tools used in each. This is both <strong>for</strong> historicalreasons, as the field <strong>of</strong> genetics predates modern gene expression experiments, and the uniquecharacteristics <strong>of</strong> the data, such as discreteness <strong>of</strong> the genotype, the fact that genotypes areinherited and thus show similarities between siblings, and practical considerations owing tothe much larger number <strong>of</strong> SNPs that are typically assayed, usually an order <strong>of</strong> magnitudemore than genes assayed in gene expression experiments. There<strong>for</strong>e, we now discuss the waysin which <strong>of</strong> feature selection and statistical models have been applied in genetics, focusingon detection <strong>of</strong> SNPs associated with <strong>human</strong> disease in GWAS. We do not discuss pedigreebased methods.Whether used <strong>for</strong> investigation <strong>of</strong> binary or quantitative phenotypes, statistical <strong>analysis</strong><strong>of</strong> SNP data is conventionally done on a univariable (per-SNP) basis, where each SNP is54


3.5. Feature Selection and Multivariable Models <strong>of</strong> Genetic Dataindividually tested <strong>for</strong> association with a phenotype. This approach is statistically wellstudied,and many tests fall within this category, such as the allelic and genotypic χ 2 tests,the Cochrane-Armitage trend test, and univariable logistic regression (see Section 3.4.1).Univariable statistics have been <strong>wide</strong>ly applied <strong>for</strong> detecting many variants associated with a<strong>wide</strong> range <strong>of</strong> <strong>human</strong> diseases and other phenotypes.The univariable approach has several shortcomings. First, a multiple testing correctionmust be applied due to the large number <strong>of</strong> hypothesis tests per<strong>for</strong>med, in order to controlthe type I error rate. If we use the stringent Bonferroni correction, the multitude <strong>of</strong> teststranslate to very strict p-value cut<strong>of</strong>fs which may exclude some predictive SNPs. Second,most univariable analyses do not account <strong>for</strong> LD between SNPs, leading to selection <strong>of</strong> highlycorrelated SNPs. While all such correlated SNPs can potentially be biologically in<strong>for</strong>mative,many them <strong>of</strong> may be redundant <strong>for</strong> prediction <strong>of</strong> the phenotype, since the marginal in<strong>for</strong>mationprovided by each <strong>of</strong> them is small. In this case, it may be better to select SNPswith weaker marginal association (larger p-value), but that are less correlated with the otherSNPs already selected. Third, merely detecting a set <strong>of</strong> SNPs is not in itself sufficient <strong>for</strong>the purposes <strong>of</strong> predictive modelling: all detected SNPs must be merged into one model,<strong>for</strong> example, by fitting a logistic regression to the SNPs with p-value below the cut<strong>of</strong>f. Anyin<strong>for</strong>mative SNPs that did not pass the cut<strong>of</strong>f will not be able to contribute to this predictivemodel, and predictive ability may consequently be reduced.An alternative modelling approach to univariable testing is the multivariable modelling <strong>of</strong>SNPs—predictive models that take into account all available data concurrently. Specifically,lasso penalised models are an attractive class <strong>of</strong> multivariable models that address the issuesidentified above. As discussed in Section 3.4.3, the lasso model fit is penalised with a tunablepenalty parameter. Instead <strong>of</strong> selecting SNPs by p-value, they are included or excluded fromthe model based on how much they contribute to the model fit balanced by the magnitude<strong>of</strong> their effect (with the balance determined by the penalty). Thus, the lasso approach potentiallyconsiders all SNPs in the model, with some <strong>of</strong> them receiving zero weights (becomeexcluded) dependent on the penalty; the penalty is tuned by cross-validation. There<strong>for</strong>e,lasso models need not exclude SNPs that do not achieve <strong>genome</strong>-<strong>wide</strong> significance, and allSNPs are candidates <strong>for</strong> inclusion in the model. Further, being a multivariable model —specifically, a linear model — these lasso models account <strong>for</strong> the correlation between SNPs,in that out <strong>of</strong> a group <strong>of</strong> highly correlated SNPs only one may get selected. In other words,the selected SNPs are non-redundant in terms <strong>of</strong> contribution to the predictive ability <strong>of</strong> themodel. Finally, the same model can be used <strong>for</strong> prediction <strong>of</strong> the phenotype from genotype.The usefulness <strong>of</strong> lasso multivariable models <strong>for</strong> modelling SNP effects (Ayers and Cordell,2010) has inspired several methods, including lasso penalised logistic regression (Wu et al.,2009), adaptive lasso (Yang et al., 2010), “Bayesian-inspired” logistic regression (HyperLasso)with two sparsity-inducing priors (the double exponential, which is identical to the lasso, and55


3. Review <strong>of</strong> the Analysis <strong>of</strong> Gene Expression and Genetic Datathe normal-exponential gamma prior — NEG) (Eleftherohorinou et al., 2009; Hoggart et al.,2008), and hierarchical Bayesian linear regression with priors based on existing knowledge <strong>of</strong>LD structure (Kim and Xing, 2011). Another (non-sparse) Bayesian approach was presentedby Logsdon et al. (2010). These studies and others have largely focused on assessing how wellcausal SNPs can be detected in simulated data, or how well already-characterised SNPs canbe found in real data. These studies have shown that multivariable models, especially sparseones such as the lasso and related priors such as the NEG, are better able to detect causalvariants than univariable (SNP at a time) statistics. Relatively few studies have analysed thepredictive ability <strong>of</strong> such models across a <strong>wide</strong> spectrum <strong>of</strong> <strong>human</strong> complex disease. Severalnotable exceptions include Kooperberg et al. (2010), who applied lasso logistic regression toseveral datasets (Crohn’s disease, type-1 and 2 diabetes), Wei et al. (2009) who examinedlogistic regression and support vector machine models <strong>of</strong> the same data, but achieved betterresults in terms <strong>of</strong> predictive ability, especially <strong>for</strong> type 1 diabetes, and Eleftherohorinou etal. (2009) who used HyperLasso models, based on pre-identified SNPs belonging to knownpathways, to model disease risk in Crohns disease, rheumatoid arthritis, and type 1 diabetes.These studies have shown the utility <strong>of</strong> multivariable models in risk prediction, but theyremain the minority approach in analysing SNP data, and the per-SNP univariable approachstill dominates the literature. We employ sparse models throughout this thesis: in Chapters 5and 6 we explore the use <strong>of</strong> lasso multivariable linear models <strong>for</strong> <strong>analysis</strong> <strong>of</strong> case-control SNPdatasets, in Chapter 7 <strong>for</strong> <strong>analysis</strong> <strong>of</strong> gene expression, metabolites, and SNPs, and in Chapter 8we use a sparse method that is an extension <strong>of</strong> the lasso, applied to the setting <strong>of</strong> multiplecorrelated phenotypes.56


4Prediction <strong>of</strong> Breast Cancer Prognosis usingGene Set StatisticsGene expression microarrays measure mRNA concentration in cells, as a proxy <strong>for</strong> measuringgene activity. The advent <strong>of</strong> relatively cheap gene expression microarray technology, especiallycommercial oligonucleotide arrays, has made it possible to assay tissue with variousphenotypes under a multitude <strong>of</strong> conditions. Due to high interest in the molecular basis <strong>of</strong><strong>human</strong> diseases, there have been many expression experiments that explore <strong>human</strong> cell linesand <strong>human</strong> tissue originating from samples with different conditions, especially those relatedto disease such as cancer (<strong>for</strong> example, (Golub et al., 1999; Ramaswamy et al., 2001; van ’tVeer et al., 2002), and many others (Sotiriou and Pusztai, 2009)). The ultimate aim <strong>of</strong> suchstudies is two-fold — to better understand the mechanisms underlying disease (etiology), andto define better markers <strong>of</strong> disease <strong>for</strong> early detection, diagnosis, and prognosis.In one <strong>of</strong> the first such studies, van ’t Veer et al. (2002) considered gene expression datafrom breast tissue coming from breast cancer patients, with the goal <strong>of</strong> predicting whetherdistant metastasis (metastasis into other organs outside the breast) would occur within fiveyears. Subsequently, many studies have produced predictive gene lists <strong>for</strong> different diseases.However, gene lists produced from similar datasets, or even lists produced from slightly differentversions <strong>of</strong> the same data, <strong>of</strong>ten showed little overlap, raising doubts about the validity<strong>of</strong> these lists. Our approach to this issue is to use pre-existing knowledge, in the <strong>for</strong>m <strong>of</strong>groups <strong>of</strong> genes (gene sets), to <strong>for</strong>m aggregate features that are then used <strong>for</strong> classification.By aggregating over gene sets, the resulting features are less affected by noise or experimental57


4. Gene Sets <strong>for</strong> Breast Cancer Prognosisvariability. In this chapter we show that the use <strong>of</strong> gene sets produces feature lists that aremore stable and reproducible, while maintaining the same predictive ability <strong>of</strong> the individualgenes, and that these gene sets correspond to biologically plausible mechanisms <strong>of</strong> cancermetastasis, such as the cell cycle.4.1. IntroductionBreast cancer is one <strong>of</strong> the most common cancers in the Western world, and the most commoncancer among Western women (Weigelt et al., 2005). Much attention has been devoted tounderstanding the biological mechanisms <strong>of</strong> breast cancer, towards several goals. The firstgoal is early diagnosis, be<strong>for</strong>e major symptoms appear. A second goal is prognostication —prediction <strong>of</strong> prognosis based on current data, mainly the recurrence <strong>of</strong> distant metastasiswhich is the main cause <strong>of</strong> death. A third goal is increased biological insight, to allowdevelopment <strong>of</strong> more effective treatments.Here we concentrate on the issue <strong>of</strong> prognostication, more precisely, predicting distancemetastasis in breast cancer patients <strong>for</strong> up to 5 years into the future. The cut<strong>of</strong>f point<strong>of</strong> 5 years is an indeed arbitrary one, and binarising a continuous variable leads to someloss <strong>of</strong> in<strong>for</strong>mation — <strong>for</strong> example, with two patients with relapses <strong>of</strong> 4.9 and 5.1 yearsrespectively, the first is considered to have a poor outcome but the second is considered as apositive outcome. Continuous time-to-event data can be and has been analysed using survivalmodels. However, we use the binarised <strong>for</strong>m <strong>of</strong> the outcome since we wish to compare ourresults with previous studies that have done the same, and converting the problem into abinary classification task is convenient since there are many tools <strong>for</strong> binary classification.Those patients that are predicted to relapse within 5 years are considered to be a highrisksubgroup. The goal is to identify these patients based on gene expression data, sothat they can be treated more aggressively with adjuvant chemotherapy. One <strong>of</strong> the basicanalyses <strong>of</strong> gene expression is searching <strong>for</strong> differentially expressed (DE) genes between twoconditions. One <strong>of</strong> the simplest analyses <strong>of</strong> gene expression is the heatmap, which utiliseshierarchical clustering <strong>of</strong> the samples and the genes, to produce a visual display <strong>of</strong> patterns <strong>of</strong>differential gene expression across two phenotypes. Figure 4.1 shows a heatmap <strong>of</strong> the top 50differentially expressed genes in a subset <strong>of</strong> 250 samples from five breast-cancer datasets,where hierarchical clustering was used to cluster both the samples and the genes. Heatmapsare useful <strong>for</strong> visualising gross features <strong>of</strong> the data, such as whether any there substantialdifferences in gene expression between groups <strong>of</strong> samples. In this example, roughly threemajor groups <strong>of</strong> genes with similar expression pr<strong>of</strong>iles are visible (group 1 with probesets204475 at through 206023 at, group 2 with 215176 x at through 217157 x at, and group 3with 220177 s at through 216474 x at). Considering the phenotypic status shown <strong>for</strong> eachsample, we can see some differential expression between the two classes: the left hand side58


4.1. Introduction<strong>of</strong> the plot is mostly controls (no-relapse), with low gene expression <strong>for</strong> group 1 and highgene expression <strong>for</strong> groups 2 and 3. In contrast, on the right hand side there are far morecases (relapse) samples, the genes in group 1 show low expression overall, and group 3 and tosome extent group 2 shows higher expression. While these patterns are highly suggestive <strong>of</strong>associations between the gene expression levels and the phenotypes, in itself this in<strong>for</strong>mationis not enough to quantify these associations so that we may be predict the phenotype <strong>for</strong> anew sample, and hence cannot be used <strong>for</strong> prognosis prediction.Two <strong>of</strong> the first studies to propose a prognostic gene list <strong>for</strong> predicting breast cancer distantmetastasis was by van ’t Veer et al. (2002), and a subsequent study by van de Vijver et al.(2002), where a classifier was trained to predict metastatic class from an annotated datasets(supervised classification). The van ’t Veer classifier is based on correlation between geneexpression and the class label. First, they selected about 5,000 significantly differentiallyexpressedgenes <strong>of</strong> the 25,000 genes on the array, over 78 samples. Out <strong>of</strong> these, they thenselected 271 genes that had absolute correlation <strong>of</strong> 0.3 or higher with the disease status.Of the 271 genes, they selected genes a second time, by starting with an empty list andadding 5 genes at a time. The optimal size <strong>of</strong> the list was evaluated using leave-one-outcross-validation. Finally, they arrived at a prognostic list <strong>of</strong> 70 genes (termed NKI70). Theyreported classification accuracy <strong>of</strong> 83% on the training set. This prognostic gene list wasthen validated on another independent dataset consisting <strong>of</strong> 19 samples, with similar results.Other prognostic lists were later compared by Fan et al. (2006); Haibe-Kains et al. (2008),and showed high concordance in terms <strong>of</strong> classifying patients into the same risk categories.Later, the 70-gene list was renamed MammaPrint (Wittner et al., 2008), and is currentlybeing commercialised <strong>for</strong> diagnostic purposes.Concurrently, Michiels et al. (2005) evaluated seven studies <strong>of</strong> breast cancer, reporting thatin all but one, predictive ability was not significantly better than random, and that differentstudies produced different lists <strong>of</strong> prognostic genes. They used random splitting <strong>of</strong> the datainto training and testing in order to estimate the predictive ability <strong>of</strong> these models. Similarly,Ein-Dor et al. (2005) again analysed the van ’t Veer data, and again showed that differentlists produced from random perturbations <strong>of</strong> the same data were highly disjoint. Later,they estimated that thousands <strong>of</strong> samples, rather than the hundreds routinely available inmicroarray studies, would be required in order to achieve stability <strong>of</strong> the gene lists (Ein-Doret al., 2006).These results raise several questions. First, are the genes identified by such studies trulyassociated with cancer and metastasis, or are they spurious, the results <strong>of</strong> complex modelsoverfitting the noisy data? Second, if the genes are associated with cancer, are they alsocausally related to it? A gene may be downstream <strong>of</strong> a cancer-causing gene and there<strong>for</strong>ebe associated with cancer but not cause cancer. Third, can a stable gene list be found atall? Fourth, do the different lists actually represent the same underlying pathways and hence59


4. Gene Sets <strong>for</strong> Breast Cancer PrognosisColor Key−4 0 2 4Row Z−Score216984_x_at217148_x_at211645_x_at215176_x_at217235_x_at217157_x_at214768_x_at207134_x_at205683_x_at216474_x_at218730_s_at204014_at204015_s_at203485_at210222_s_at205898_at220177_s_at209368_at220005_at204475_at206023_at219208_at220085_at213520_at204695_at212949_at221520_s_at220651_s_at219555_s_at204709_s_at221521_s_at201291_s_at201292_at205046_at219990_at201710_at221436_s_at205034_at212022_s_at201890_at202095_s_at219918_s_at203213_at204641_at204822_at202870_s_at208079_s_at204825_at203764_at202705_atKIU_304C89KIU_233C91KIU_184B38KIU_316C64KIU_155B52OXFU_254116 23VDXRHU_317 65 38 44124 62VDX_616 8292124 30VDX_118 VDX_233VDXGUYU_4072 VDX_851161VDXKIU_292C66 105KIU_229C44 78VDX_286 133VDXKIU_15H4 127KIU_231C80 VDX_817VDX_106 6884VDX_631 42VDX_70 48VDX_783VDXGUYU_4080 VDX_798VDX_635VDX_808 VDX_40VDX_815 71VDX_15 72VDX_728 VDX_874 VDX_620 149 169VDX_779 47VDXKIU_282C51 145VDXKIU_266C51VDXRHU_1959 192VDX_32VDX_738 175VDX_122VDXGUYU_4076VDXOXFU_544 VDX_716VDX_797 VDX_782 VDX_647VDXIGRU_219863VDXGUYU_4087VDX_287 VDX_866VDXIGRU_306818VDXRHU_1721 50KIU_28C76VDXKIU_199B55 106104 123 128VDXKIU_27C4 VDX_244 63VDX_240 19VDX_646VDX_627 163VDX_843 VDX_9KIU_151B84VDXOXFU_1328 24KIU_74A63OXFU_348 37OXFU_1328 OXFU_522OXFU_549 12OXFU_1065 OXFU_359176 VDX_602VDXOXFU_157 127VDXRHU_302VDX_33 17104VDX_204 40VDX_108 6VDX_909128OXFU_531 117OXFU_127 142 75VDXKIU_1246 VDX_114VDX_9249KIU_188B13VDXKIU_15E7 45KIU_278C80KIU_87A79OXFU_1605 65OXFU_662181VDXIGRU_24559637 VDX_54VDX_140VDX_137 183VDX_231141 46VDX_741 VDX_66 33 22172 47VDX_200 VDX_729 VDX_18133 55VDX_100 VDX_247122VDXRHU_1315VDX_873VDXRHU_2393VDX_72VDXKIU_125B43VDXRHU_1522VDXGUYU_4022VDXIGRU_246821VDXKIU_15C5VDX_774 VDX_44VDXKIU_1000 VDX_78 166KIU_309C49 196KIU_111B51KIU_86A40 VDX_249VDXIGRU_272823VDXGUYU_4074 198VDXKIU_15E2VDX_714 VDX_93VDXGUYU_4098 VDX_27VDX_846 VDX_777 121131 85VDXRHU_1917VDXRHU_2039 VDX_7VDXIGRU_171260 136VDXRHU_4188 VDX_844VDX_96 139VDX_696 55VDX_217KIU_113B11 167VDX_864 8125VDXRHU_5228 6129VDX_94 153VDX_216VDXIGRU_271982 113103 23VDX_763 VDX_805 108VDXGUYU_4043 50VDXKIU_2656 VDX_903VDX_852 VDX_813 VDX_64 63OXFU_88 13OXFU_559 156 78VDXRHU_1568VDXKIU_2743VDXGUYU_4049VDXKIU_1708VDX_833 VDX_79VDX_633 VDX_20Figure 4.1.: A heatmap showing differentially expressed genes (rows) over a subset <strong>of</strong> 250samples (columns) from the five breast cancer datasets. Differential expressionwas determined using linear models in limma (Smyth, 2005). Samples are colouredred and blue <strong>for</strong> < 5 years and ≥ 5 years to metastasis, respectively. Under andover expressed genes are coloured red and green, respectively.60


4.1. Introductionmight be more in agreement when interpreted in the <strong>wide</strong>r biological context than is otherwiseapparent?If the different predictive genes truly represent the same underlying biology, then perhapswhat is needed is to evaluate genes as members <strong>of</strong> gene pathways, where we loosely define apathway as a set <strong>of</strong> interacting genes, and use the pathway in<strong>for</strong>mation to guide the selection<strong>of</strong> predictive genes. Ideally, one would like to have detailed gene pathway in<strong>for</strong>mation, whichcan then be used to select genes with a potential causal link to cancer and metastasis. Thishas largely not been possible due to limitations on data size (too few microarray samplesavailable), and the complexity <strong>of</strong> gene-gene interactions. There<strong>for</strong>e, the problem <strong>of</strong> findingthe pathway in<strong>for</strong>mation must be tackled in other ways. One way is to assume that genes withcorrelated expression belong together in one pathway (or are somehow otherwise related toeach other even if they do not interact directly), and to find the sets de novo in the data, usingmethods such as searching over a space <strong>of</strong> models representing regulation programs (Segalet al., 2003), and using k-means clustering (Yousef et al., 2007). Similarly, van Vliet et al.(2007) used an unsupervised module discovery method to find gene modules, calculated adiscrete module activity score, and used the score as a feature <strong>for</strong> a naive Bayes classifier.They reported that classifiers based on gene sets were slightly better predictors <strong>of</strong> breastcancer outcome than those based on individual genes. Chuang et al. (2007) used a mutualin<strong>for</strong>mation scoring approach to analyse known protein-protein interaction (PPI) networks,infer gene pathways, and find subnetworks predictive <strong>of</strong> breast cancer metastasis.The other main approach to leveraging pathway structure has been to use external pathwayin<strong>for</strong>mation, <strong>for</strong> example from interactions defined in the literature. Svensson et al. (2006)analysed expression data from ovarian cancers based on gene sets from the Gene Ontology(GO) (Ashburner et al., 2000); to represent each set’s expression they used a statistic that isessentially a majority-vote <strong>of</strong> the over- and under-expressed genes (whether the set if overorunder-expressed on average). In a large study <strong>of</strong> 12 breast cancer datasets, Kim and Kim(2008) reported a classification accuracy <strong>of</strong> 0.676 over 6 additional datasets, using 2411 genesets from GO categories, pathway data, and other sources. The balanced accuracy was 0.64 1 ,averaged over all dataset pairs (one dataset in the pair used <strong>for</strong> training and the other <strong>for</strong>testing). Additionally, Kim and Kim (2008) reported low overlap between the top gene setsidentified, in terms <strong>of</strong> their common genes. Lee et al. (2008) used the Molecular SignatureDatabase (MSigDB) C2 gene sets (Subramanian et al., 2005), which are lists <strong>of</strong> manuallycurated genes found to be associated with cancer in the literature and large-scale data miningexperiments; They selected gene sets using the t-test on their constituent genes, and usedthe sets as features <strong>for</strong> classification in several cancer datasets, including breast cancer. They1 The balanced accuracy is BACC = (sens + spec)/2 where sens and spec are the sensitivity and specificity,respectively, and accounts <strong>for</strong> uneven proportions <strong>of</strong> the two classes in the data, unlike the standardaccuracy.61


4. Gene Sets <strong>for</strong> Breast Cancer Prognosisdid not, however, examine whether features derived from gene sets are any more stable thanthose based on individual genes, a question which is the main focus <strong>of</strong> our work.Once a tentative or known gene pathway has been identified, the next issue is how to usethe expression levels <strong>of</strong> its constituent genes in a meaningful way. Some options are to usethe mean or median expression (Guo et al., 2005), the first few principal components (Bild etal., 2006), and the z-statistic (Törönen et al., 2009). Below we examine several <strong>approaches</strong>,which we call set statistics.In this work we propose using prior knowledge, in the <strong>for</strong>m <strong>of</strong> pre-specified lists <strong>of</strong> genes(gene sets) based on the Molecular Signatures Database (MSigDB) (Subramanian et al.,2005), in order to <strong>for</strong>m new features from individual genes. A gene set is a group <strong>of</strong> genes thathave been selected due to shared functionality, membership in the same biological pathway,or empirical relatedness (coexpression). Moving away from considering genes in isolation,these features serve as proxies <strong>for</strong> measuring the activity <strong>of</strong> the set as a whole. There aremany <strong>approaches</strong> to gene set enrichment (Ackermann and Strimmer, 2009; Subramanian etal., 2005); however, it is not clear whether these enrichment measures imply good predictiveabilities as well. Using five breast cancer datasets, we compare features derived from genessets with features based on individual genes, with respect to the following criteria:• Discrimination: ability to predict metastasis within 5 years, both on average and itsvariance;• Stability <strong>of</strong> the ranks <strong>of</strong> individual features within datasets;• Concordance between the weights and ranks <strong>of</strong> features from different datasets; and• Underlying biologically processes indicated by the gene sets.4.2. MethodsWe now describe the breast cancer datasets used in this work, the gene set statistics approach,and the framework <strong>for</strong> determining which gene sets are associated with the phenotypic outcome.4.2.1. DataWe used five previously published breast cancer datasets from NCBI GEO (Edgar et al.,2002): GSE2034 (Wang et al., 2005), GSE4922 (we used the untreated subset <strong>of</strong> the Singaporecohort) (Ivshina et al., 2006), GSE6532 (Loi et al., 2007, 2008) (untreated cohort),GSE7390 (Desmedt et al., 2007), and GSE11121 (Schmidt et al., 2008) (Mainz cohort). All fiveare assayed on the Affymetrix HG-U133A microarray plat<strong>for</strong>m (some <strong>of</strong> the datasets includedother microarray plat<strong>for</strong>ms which were removed). We removed quality control probesets, and62


4.2. Methodsprobesets with close to zero variance across the samples. in total, each microarray had 22,215remaining probesets. We normalised GSE2034 and GSE6532 using quantile normalisationas implemented in RMA (Bolstad et al., 2003). For the remaining datasets, raw data wasnot available and we used the normalised data, as normalised by their respective authors inthe original publications. All data were converted to log 2 scale, as gene expression data aretypically better approximated by the normal distribtution on this scale (Wang and Speed,2003). Missingness was very low, with only GSE6532 having 12 missing values, there<strong>for</strong>ewe used simple median imputation <strong>for</strong> each gene instead <strong>of</strong> more sophisticated imputation<strong>approaches</strong> (Bø et al., 2004; Kim et al., 2005)The data contains a majority <strong>of</strong> lymph-node-negative and some node-positive breast cancersamples.For GSE7390, GSE11121, and GSE2034 none <strong>of</strong> the patients received adjuvanttreatment. For GSE6532 and GSE4922, some patients received adjuvant treatment, thesewere removed from the data.The data contains patients with both ER-positive and ERnegativetumours. Patients were classified into two groups, low and high risk, according tothe time to distant metastasis, using a cut<strong>of</strong>f point <strong>of</strong> 5 years. Patients censored be<strong>for</strong>e thecut<strong>of</strong>f were considered non-in<strong>for</strong>mative and were removed. The final number <strong>of</strong> samples <strong>for</strong>each datasets are shown in Table 4.1.4.2.2. DiscriminationWe measure discrimination <strong>of</strong> a classifier using the Area Under the ROC Curve (AUC orAROC) (Hanley and McNeil, 1982), defined asÂUC =N1+N + N − ∑N −∑ [I(ŷ i > ŷ j ) + 1 2 I(ŷ i = ŷ j )] , (4.1)i=1 j=1where N + + N − = N are the number <strong>of</strong> positive and negative labels, respectively; ŷ i is theprediction <strong>for</strong> the ith sample, and I(⋅) is the indicator function, I(x) = 1 when x is true and 0otherwise. The sample AUC has the probabilistic interpretation as the (estimated) probability<strong>of</strong> correctly ranking two randomly chosen samples in the correct order (that is, short-termsurvival be<strong>for</strong>e long-term survival), plus a correction <strong>for</strong> ties.AUC = 0.5 is equivalent torandom ranking, whereas AUC = 1 and AUC = 0 correspond to perfect and perfectly-wrongranking, respectively.Unlike the error rate (or, conversely, the accuracy), AUC does notdepend on the class balance <strong>of</strong> the dataset, hence it can be meaningfully compared acrossdifferent datasets.4.2.3. ClassifiersFeature instability, manifested as discordant gene lists, can be caused both by inherent instability(genes truly have high variance between different samples), by overfitting the classifier63


4. Gene Sets <strong>for</strong> Breast Cancer PrognosisDataset Samples Lymph Therapy Age (years) Grade ER statusGSE2034OriginalRemovedGood(


4.2. Methodsto the data, and by redundancies in the data, such as perfect correlation between features.This is especially the case in gene expression data, where the number <strong>of</strong> features p far outweighsthe number <strong>of</strong> samples N, and the underlying measurements are known to be noisy.In such scenarios, there is always the possibility <strong>of</strong> overfitting — the situation where a modelper<strong>for</strong>ms well or even perfectly on the training data, but per<strong>for</strong>ms worse or even not betterthan random on independent testing data. There<strong>for</strong>e, to reduce the risk <strong>of</strong> overfitting, we usethe centroid classifier (Schölkopf and Smola, 2002). The centroid classifier is equivalent to aheavily-regularised support-vector machine (Bedo et al., 2006) and to Fisher Linear DiscriminantAnalysis (LDA) with diagonal covariance and uni<strong>for</strong>m priors (Dabney and Storey, 2007;Tibshirani et al., 2003). The centroid classifier implements a model with strong assumptionsabout the data — it does not account <strong>for</strong> the variance <strong>of</strong> each genes and the fact that genesare correlated — the weight estimated <strong>for</strong> each gene is independent <strong>of</strong> the weights estimates<strong>for</strong> all other genes, unlike, <strong>for</strong> example, logistic regression or an SVM.In practical terms, we expect that the centroid classifier is less prone to overfitting than anSVM or similar classifier. We further stabilise the centroid’s estimates by averaging them overrandom subsamples <strong>of</strong> the data. Despite its simplicity, the centroid classifier per<strong>for</strong>ms well inmicroarray studies (Bedo et al., 2006), where commonly the number <strong>of</strong> features is much greaterthan the number <strong>of</strong> samples (p ≫ N), and there is significant noise. For the centroid classifier,we observed discrimination similar to or better than several other classifiers, including SVMs,nearest shrunken centroids (Tibshirani et al., 2003), and the van ’t Veer classifiers.The centroid classifier finds the centroid <strong>of</strong> each class across the p-features, that is, the p-vector <strong>of</strong> average gene expression in each class. New observations are classified by comparingtheir expression vector with the two centroids, and choosing the closest centroid. Given anp × N matrix Z = [z ji ] 1≤j≤p , the p-vector centroids <strong>of</strong> the positive and negative classes are,respectively1≤i≤Nc + = 1N +∑ z i , c − = 1{i∣y i =+1}N −∑ z i ,{i∣y i =−1}where N + +N − = N are the number <strong>of</strong> samples in the positive and negative class, respectively,and z i is the ith expression vector <strong>of</strong> p-features (ith column <strong>of</strong> Z — one sample). The centroidclassifier predicts using the inner product ruleŷ i = ⟨z i − c, w⟩where ⟨x, y⟩ = ∑ p j=1 x jy j is the inner (dot) product, c = (c + +c − )/2 is the point midway betweenthe centroids, and the feature weights w are the p-vector connecting the two centroidsw = c + − c − . (4.2)The sign <strong>of</strong> ŷ i is then the predicted class. For calculation <strong>of</strong> area under receiver operating65


4. Gene Sets <strong>for</strong> Breast Cancer Prognosischaracteristic curves (AUC) we use ŷ i as the prediction, since it produces AUC estimates withlower variance than does use <strong>of</strong> the binary class prediction sign(ŷ i ), since ties in the ROCcalculation are more likely <strong>for</strong> discrete predictions {-1, +1} then <strong>for</strong> continuous predictions,which manifests as jagged ROC curves and equivalently as AUC with higher variance.Note that the centroid classifier used here is similar but not identical to the classifier usedby van ’t Veer et al. (2002); they assigned each sample to the class that had the highestPearson correlation <strong>of</strong> its centroid with the sample. This is equivalent to our version <strong>of</strong> thecentroid classifier when the samples are scaled to unit norm (McLachlan et al., 2004).The centroid classifier requires no tuning since it has no hyperparemeters, making it fastto compute. Recursive feature elimination (RFE) (Guyon et al., 2002) can be used to trainclassifiers with different numbers <strong>of</strong> features, in order to potentially find parsimonious modelswith the best predictive features. In RFE, a full model (all features) is first trained on thedata. Then one or more features are dropped from the model, based on the absolute value<strong>of</strong> their weights (smallest weights first), and the model is re-trained using the reduced set <strong>of</strong>features. The process continues until there are no features in the model. For the centroidclassifier, RFE is especially simple since features can be simply eliminated in reverse order<strong>of</strong> their absolute weights, and the model does not need to be re-trained each time since theweights <strong>for</strong> each feature are independent <strong>of</strong> each other.4.2.4. Internal versus External ValidationSince we have five datasets, it might be reasonable to combine them into one. However, wewere interested in measuring the concordance between datasets, rather than per<strong>for</strong>ming ameta-<strong>analysis</strong>. The inter-dataset <strong>analysis</strong> emulates the real-world simulation where differentstudies are per<strong>for</strong>med separately, rather than pooled together. There<strong>for</strong>e, we distinguishbetween internal and external validation. In the <strong>for</strong>mer, we estimate the classifier’s generalisationinside each dataset, using repeated random subsampling; the subsampling is thenused to <strong>for</strong>m a bagged classifier <strong>for</strong> each dataset (described in Section 4.2.1). Bagging refersto bootstrap aggregation (Hastie et al., 2009a), a procedure that involves training multipleseparate classifiers on random samples <strong>of</strong> the data building a final classifier based on the averagingthe individual classifiers (either their model weights or their predictions). The randomsamples can be chosen with replacement, as in the bootstrap procedure (Efron and Tibshirani,1993), or without replacement, as in cross-validation. Bagging reduces the variance <strong>of</strong>the predictions, without increasing the bias (Hastie et al., 2009a).We then per<strong>for</strong>m external validation, where the bagged classifier from each dataset is usedto predict the metastatic class <strong>of</strong> patients from another dataset. This is a more realisticestimate <strong>of</strong> the classifier’s discriminative ability. For internal validation, we used repeatedrandom subsampling to estimate the classifier’s internal generalisation error, as measured byAUC. In this approach, the dataset is randomly split B times into training and testing parts66


4.2. Methods(2/3 and 1/3 <strong>of</strong> the data, respectively). We used B = 25 splits. Repeated subsampling with a2/3–1/3 split is similar to the 0.632 bootstrap without replacement (Binder and Schumacher,2008). Each split results in one model; the predictions from B models are then combined intoone bagged prediction by averaging over the B predictions, and using that vector <strong>of</strong> averagesas the final prediction.4.2.5. Molecular Signatures Database Gene SetsWe used five gene collections, totalling 5452 gene sets, from the Molecular Signatures DataBase(MSigDB) v2.5 (http://broadinstitute.org/gsea/msigdb):• C1: 386 positional gene sets, defined <strong>for</strong> each <strong>human</strong> chromosome and cytogeneticband that has at least one gene. These sets represent expression effects associatedwith chromosomal amplifications and deletions, dosage compensation, and epigeneticsilencing.• C2: 1892 curated gene sets, collected from annotated sources such as the Kyoto Encyclopedia<strong>of</strong> Genes and Genomes (KEGG) (Kanehisa and Goto, 2000; Kanehisa etal., 2010), PubMed publications, and several pathway databases such as BioCarta(http://www.biocarta.com) and Reactome (http://www.reactome.org) (Matthewset al., 2009).• C3: 837 motif gene sets, which are gene sets that share cis-regulatory motifs conservedacross <strong>human</strong>s, mice, rats, and dogs (Xie et al., 2005).• C4: 883 computational gene sets, derived from data mining <strong>of</strong> large collections <strong>of</strong> cancerrelatedgene expression data (Brentani et al., 2003; Segal et al., 2003).• C5: 1454 Gene Ontology gene sets, derived from the Gene Ontology database (Ashburneret al., 2000). Note that GO genes set are not necessarily co-regulated genes.Within and between these collections, the gene sets may overlap.4.2.6. Gene Set StatisticsThe purpose <strong>of</strong> the set statistic is to reduce the set’s expression matrix to a single vector,which is then used as a feature <strong>for</strong> classification. The intention is <strong>for</strong> the set statistic tobe representative <strong>of</strong> the expression levels <strong>of</strong> the set in a useful way. Here we describe thedifferent set statistics used. All <strong>of</strong> our set statistics are unsupervised, in the sense that theydo not take into account the metastatic class, unlike, <strong>for</strong> example the t-test set statistic (Tianet al., 2005), GSEA (Subramanian et al., 2005), or GSA (Efron and Tibshirani, 2007). Anystandard classifier, such as a support vector machine (SVM), can be employed by using theseset statistics as features. The gene sets statistics used here are shown in Table 4.2.67


4. Gene Sets <strong>for</strong> Breast Cancer PrognosisX ∶ p × N matrix <strong>of</strong> gene expressionS 1S 2S 3⎡⎢⎣x 11 x 12 x 13 x 14 x 15 x 16 x⎤17 ⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦x 21 x 22 x 23 x 24 x 25 x 26 x 27x 31 x 32 x 33 x 34 x 35 x 36 x 37x 41 x 42 x 43 x 44 x 45 x 46 x 47x 51 x 52 x 53 x 54 x 55 x 56 x 57x 61 x 62 x 63 x 64 x 65 x 66 x 67Z ∶ M × N matrix <strong>of</strong> gene set statistics⎡⎢⎣z 11 z 12 z⎤13 ⎥⎥⎥⎥⎥⎥⎦z 21 z 22 z 23z 31 z 32 z 33Figure 4.2.: Schematic <strong>of</strong> how the gene set features are constructed from three gene sets S 1(red), S 2 (green), and S 3 (blue), each with 2, 3, and 1 gene/s, respectively. Notethat <strong>for</strong> clarity we show non-overlapping sets, although the sets can overlap inpractice.NotationHere, X = [x ki ] 1≤k≤p is the p × N matrix <strong>of</strong> gene expression levels, where N is the number <strong>of</strong>1≤i≤N,samples and p is the number <strong>of</strong> genes. The ith column (p-vector) <strong>of</strong> X is denoted x i .Every gene belongs to one or more gene sets S j , such that S j ⊂ {1, ..., p}, <strong>for</strong> j = 1, ..., Mwhere M is the number <strong>of</strong> gene sets. The cardinality <strong>of</strong> the jth set (number <strong>of</strong> genes in theset) is denoted s j = ∣S j ∣. We use X Sj to denote the s j × N submatrix <strong>of</strong> X that correspondsto the jth gene set. The resulting M × N matrix <strong>of</strong> gene set statistics Z = [z ij ] is shown inFigure 4.2.Set Centroid and Set MedianThe centroid <strong>of</strong> a set S j is the mean expression levels over all genes in the set. The matrix <strong>of</strong>all centroids is an M × N matrix with columns (all samples <strong>for</strong> one gene set)c j = 1 ∑ x k ∈ R N <strong>for</strong> j = 1, ..., M, (4.3)∣S j ∣k∈S jwhere x k = (x k1 , ..., x kp ) T is the expression level <strong>for</strong> the kth gene. Similarly, the set medianis the median expression level <strong>for</strong> all genes in the set <strong>for</strong> a given sample.The motivation <strong>for</strong> the set centroid is that is reduces the variance in each feature, sincethe sample variance <strong>of</strong> the mean <strong>of</strong> n samples <strong>of</strong> a random variable X is the square <strong>of</strong> the68


4.2. Methodsstandard error <strong>of</strong> the sample mean ¯x. The actual decrease in variance depends on the degree<strong>of</strong> correlation <strong>of</strong> the variables. Another interpretation is that all the genes in the same set areshrunk towards the mean, thereby reducing the effect <strong>of</strong> outlier genes and reducing potentialoverfitting.The set median <strong>for</strong> the jth set is m j = (m j1 , ..., m jN ) T , defined asm ji =⎧⎪⎨⎪⎩R j((sj +1)/2) <strong>for</strong> odd s j ,(R j(sj /2) + R j(sj /2+1)) /2 <strong>for</strong> even s j ,(4.4)where R j = (R j1 , ..., R jsj ) T is the sorted vector <strong>of</strong> gene expression values <strong>for</strong> the jth set, ands j = ∣S j ∣ is the number <strong>of</strong> genes in the jth set. The set median is less sensitive to outliers thanthe set centroid.Set MedoidThe medoid <strong>of</strong> a set S j is defined as the genes in the set S j closest in Euclidean distance tothe centroid <strong>for</strong> each sample im ji = x k ∗i iwhere k ∗ i = arg mink i ∈S j(x ki i − c ji ) 2 (4.5)where x ki is the expression level <strong>for</strong> the kth gene (out <strong>of</strong> the ∣S j ∣ genes in the set) in the ithsample. There<strong>for</strong>e, m j = (m j1 , ..., m jN ) T is the vector <strong>of</strong> medoid samples <strong>for</strong> the jth gene set.Also, this medoid is an element-wise medoid — <strong>for</strong> each sample separately — not the sameas an overall medoid over all samplesm ′ j = x k ∗where k ∗ = arg mink∈S j∣∣x k − c j ∣∣ 2 2. (4.6)The medoid in Eqn. 4.6 is over all samples, and there<strong>for</strong>e it is one <strong>of</strong> the genes in the set,whereas the sample-wise medoid (Eqn. 4.5) each elements <strong>of</strong> the vector may originate from adifferent gene. The medoid over all samples in Eqn. 4.6 is denoted set medoid2 in the results.Set t-statisticThe set centroid does not take into account different means and variances between the genes,nor the fact that a gene may have a large value <strong>of</strong> the mean but high variance as well (lowsignal to noise ratio). An alternative is to use the one-sample t-statistic. The matrix <strong>of</strong>t-statistics is computed by first centering and scaling the expression matrix so that each genehas mean zero and unit variance, and then computing the t-statistic <strong>for</strong> each set in eachsample√t ij = c ij ∣Sj ∣/sd ij (4.7)69


4. Gene Sets <strong>for</strong> Breast Cancer Prognosiswhere c ij is the ith coordinate <strong>of</strong> the jth centroid statistic from Eqn. 4.3, c j = [c ij ] 1≤i≤N ,and sd ij is the standard deviation <strong>of</strong> the genes in set j in the ith sample. Scaling is doneto prevent spurious t-statistics due to very small variances from inflating the importance <strong>of</strong>“non-interesting” genes. We also excluded sets with fewer than 30 genes <strong>for</strong> the same reason.Set U-statisticThe competitive U-statistic <strong>for</strong> the set, also known as Wilcoxon’s rank-sum statistic (Lehmann,1975), compares the mean ranks <strong>of</strong> the genes in the set with the mean rank <strong>of</strong> the genes outsidethe set, <strong>for</strong> all samples. We define s j = ∣S j ∣ and s ¬j = ∣⋃ M l=1 {x ∈ S l∣x ∉ S j }∣ as the number <strong>of</strong>genes in and out <strong>of</strong> the jth set, respectively (note that gene sets may overlap). The U-statisticis computed as follows:1. Create a list <strong>of</strong> gene expression ranks L i = l i1 , l i2 , ..., l isj <strong>of</strong> the s j genes in the set in theith sample.2. Sum the ranks <strong>for</strong> the set R ij = ∑ s jk=1 L k.3. The U-statistic <strong>for</strong> set S j in sample i is thenU ij = R ij − s j (s j + 1)/2. (4.8)For large n (large number <strong>of</strong> genes in the set), the jth U-statistic is approximately normallydistributed with mean µ = s j s ¬j /2 and variance σ 2 = s j s ¬j (s j + s ¬j + 1)/12. Once the U-statistic is computed, we use the log p-value as the feature <strong>for</strong> the classifier, using the normalapproximation.The U-statistic is slightly unusual in that it pits gene sets against each other — the distributiondepends on the number <strong>of</strong> genes sets rather than the number <strong>of</strong> samples. Goemanand Bühlmann (2007) argue that this statistic is inappropriate since it switches the standardrelationship between genes and samples in the experimental setup (the sample size becomesthe number <strong>of</strong> genes, not the number <strong>of</strong> microarrays); however, Barry et al. (2008) considerit a useful statistic nonetheless. In any case, we use this statistics only as a feature <strong>for</strong> a classifier,and not <strong>for</strong> making inferences about the statistical significance <strong>of</strong> the sets’ expressionlevels.First Principal Component <strong>of</strong> the SetPrincipal Component Analysis (PCA) (Hastie et al., 2009a; Ramsay and Silverman, 2006) isper<strong>for</strong>med using the singular value decomposition (SVD) <strong>of</strong> the gene set’s ∣S j ∣ × N expressionmatrix X Sj = [x ki ] k∈Sj , defined as1≤i≤NX Sj = V T S jD Sj U Sj ,70


4.2. MethodsStatisticEquationSet centroid (4.3)Set median (4.4)Set medoid (4.5), (4.6)Set t-statistic (4.7)Set U-statistic (4.8)First principal component <strong>of</strong> the set (4.9)Table 4.2.: The gene set statistics used in this work.where V Sj and U Sj are matrices whose columns are the left and right singular vectors,respectively, and D Sj is a diagonal matrix with entries consisting <strong>of</strong> the singular values (thediagonal <strong>of</strong> D 2 contains the eigenvalues <strong>of</strong> X Sj X T S j). The first eigenvector v 1 (first column<strong>of</strong> V Sj ) explains the highest amount <strong>of</strong> variance in X Sj . The first principal componentPC1 j ∈ R N <strong>of</strong> the X Sj matrix is obtained by projecting the data onto that eigenvectorPC1 j = v T 1 X Sj , (4.9)where v 1 is an s j -vector. Hence, PC1 is the best rank-1 approximation <strong>of</strong> the data. We meancentredand scaled the matrix X Sj to unit variance, gene-by-gene, be<strong>for</strong>e applying PCA, inorder to put all genes on the same scale, reducing the effect <strong>of</strong> genes with larger than usualvariance on the singular vectors.One possible problem with PCA is axis reflection (Mehlman et al., 1995). Since the eigenvectors<strong>of</strong> the covariance matrix are determined only up to a multiplicative constant, differentnumerical implementations <strong>of</strong> PCA may result in eigenvectors <strong>of</strong> opposite signs. Furthermore,even small perturbations <strong>of</strong> the same data (such as through bootstrap replications) may yieldflipped signs. This effect is increased in the presence <strong>of</strong> noise. When used as features <strong>for</strong>classification or regression, flipped signs result in flipped estimates <strong>of</strong> the model parameters.Since there are usually differences between datasets, axis reflection especially manifests itselfin negative correlation between the eigenvectors derived from each dataset. Eigenvectors fromdifferent datasets may also point in opposite directions since the gene set may change the sign<strong>of</strong> its correlation with the phenotype under different experimental conditions; we assume thisis not the case here.To mitigate the effects <strong>of</strong> axis reflection, the sign <strong>of</strong> the eigenvectors must be (arbitrarily)fixed. Since we are interested only in the first principal component, a simple heuristic solutionis to fix the sign <strong>of</strong> the first eigenvector, by flipping the sign <strong>of</strong> the vector elements if themajority <strong>of</strong> the signs are negative,v ′ 1 = v 1 sign⟨v 1 , 1⟩,71


4. Gene Sets <strong>for</strong> Breast Cancer Prognosiswhere 1 = (1, ..., 1) T is a vector <strong>of</strong> ones <strong>of</strong> same length as v 1 . Note that, even after thiscorrection, axis reflections might still occur, since some elements in the eigenvector may stillflip their sign if they are close to zero. A second caveat with PCA is that although it finds theprincipal component that explains the most variance in the predictor variables, this principalcomponent may or may not explain the variance in the response variable. A third and finalcaveat with PCA is that although it is intended to reduce the effects <strong>of</strong> noise on the data,it can itself be sensitive to noise and outliers. For example, while most <strong>of</strong> the data maylie along one direction (suggesting this direction is a good principal component), adding afew large outliers orthogonal to this direction may result in a different (orthogonal) principalcomponent being chosen.Other PCA variants have been proposed, <strong>for</strong> example, smoothed or penalised principalcomponents (Ramsay and Silverman, 2006, Ch. 9), and supervised PCA (Bair and Tibshirani,2004). We have not implemented these here, since standard PCA is more common in theliterature and the more sophisticated methods require further tuning.4.3. ResultsWe conducted experiments aimed at evaluating the utility <strong>of</strong> classifiers based on gene setsstatistics, relative to the individual gene approach. First, we assessed the ability to discriminategood from poor prognosis in the breast cancer datasets. Second, we investigated theprognostic lists derived from these statistics, in comparison with lists derived from individualgenes, and measured their stability across random perturbations <strong>of</strong> the data. Third, wemapped the prognostic lists to known biological pathways in order to find those pathwaysmost strongly associated with metastasis, both in the data as a whole and in subsets <strong>of</strong> thedata as defined by breast cancer molecular subtypes.4.3.1. Discrimination <strong>of</strong> Distant MetastasisFigure 4.3 shows the AUC <strong>for</strong> external validation — trained on one dataset and tested onanother, a total <strong>of</strong> 2 × ( 5 )2= 20 predictions (the procedure is not symmetric) — using centroidclassifiers trained on different numbers <strong>of</strong> features. The maximum number <strong>of</strong> features is22,215 <strong>for</strong> genes and 5414 <strong>for</strong> gene sets. For clarity, we only show the results <strong>for</strong> classifiersbased on expression <strong>of</strong> individual genes (denoted “raw”), the set centroid, the set median,and the set t-statistic.Unlike classifiers such as logistic regression or SVMs, the centroidclassifier’s weight <strong>of</strong> one feature does not depend on the others.While it is known thatgenes are not independently expressed, this strong assumption does not appear to reduceclassification accuracy in our data. The best AUC <strong>of</strong> about 0.7 is consistent with previousresults based on either lists or individual genes (van ’t Veer et al., 2002; Wang et al., 2005)or <strong>of</strong> gene sets (van Vliet et al., 2007). The set centroid, both set medoids, set median, and72


4.3. Results0.70●●●● ● ● ●●●●● ●●AUC0.65●0.60rawset.centroidsset.mediansset.medoidsset.medoids2set.t.stat●1248163264128256Features512102420484096541481921638422215Figure 4.3.: Average and 95% confidence intervals <strong>for</strong> AUC from external validation betweenthe five datasets, n = 2 × ( 5 )2= 20 (train,test) pairs, <strong>for</strong> different numbers <strong>of</strong> features.We show only every second confidence interval <strong>for</strong> clarity. Note that eachdataset ranks its features independently, hence, the kth feature is not necessarilythe same across datasets. Individual genes are denoted raw.73


4. Gene Sets <strong>for</strong> Breast Cancer Prognosisset t-statistic showed similar or just slightly lower AUC than that <strong>of</strong> individual genes. Theset PC, and set U-statistic showed statistically significant reductions in AUC, compared withindividual genes (Figure A.6).While the AUC does not seem to improve, on average, by using set statistics rather thanindividual genes, Figure 4.4 shows that the variance <strong>of</strong> the AUC is lower <strong>for</strong> the set t-statisticthan <strong>for</strong> individual genes. This observation is consistent with the expectation that creatingnew features by averaging over individual genes reduces the variance <strong>of</strong> the input andconsequently that <strong>of</strong> the classifier’s prediction.Further, we observed that the discrimination <strong>of</strong> metastasis good/poor outcome from thegene sets statistics were similar in the internal and external validation, indicating that thecentroid classifier did not significantly over- or under-fit the data, and that the AUC estimatesfrom internal validation, which are easier to obtain since only one dataset is required, arerepresentative <strong>of</strong> the external validation which is more complicated since at least two datasetsare required.4.3.2. Stability <strong>of</strong> Feature RanksWe were interested in how the ranks <strong>of</strong> a single feature vary, since we prefer features that arehighly ranked on average and have small variability about that average. If a feature has lowaverage rank and large variability, it may sometimes appear at the top <strong>of</strong> the list simply bychance when the experiment is repeated, indicating that it is not a reliable predictor. On theother hand, features with high average rank and large variability may still be good predictorson average but will create unstable gene lists, manifesting as different datasets producingdifferent gene lists <strong>of</strong> similar predictive ability.To evaluate the variability <strong>of</strong> the ranks, we used the percentile bootstrap to sample theobservations with replacement, generating a bootstrap distribution <strong>for</strong> the centroid weights <strong>for</strong>genes and gene sets in one dataset (GSE4922). Since there are 22,215 genes and only 5414 genesets, a reduced gene list was produced by training a centroid classifier on the GSE11121 datasetand selecting the top 5414 genes based on their absolute centroid weights ∣w j ∣ (Eqn. 4.2); thegene list was fixed across the bootstrap replications.In many cases we are interested in a small signature comprised <strong>of</strong> the most useful orpredictive features. There<strong>for</strong>e, we selected the top 15 genes and gene sets based on theirmean rank. Figure 4.5 shows the mean, 2.5%, and 97.5% percentiles from 5000 bootstrapreplications, <strong>for</strong> these top features (shown from highest to lowest) using the set centroidstatistic. It is clear that the top gene sets have lower variation than the top genes. In light <strong>of</strong>these results, it is not surprising that lists <strong>of</strong> prognostic genes show little overlap, as even thebest ranked genes vary considerably within the same dataset, let alone between them; geneset features are more stable.74


4.3. Results0.012Var(AUC)0.0100.0080.006●rawset.centroidsset.mediansset.medoidsset.medoids2set.t.stat●0.004●0.002●● ● ● ● ● ● ● ● ● ● ●1248163264128256512102420484096541481921638422215FeaturesFigure 4.4.: Variance and 95% confidence intervals <strong>of</strong> the AUC from external validation betweenthe five datasets, n = 2 × ( 5 )2= 20 (train,test) pairs, <strong>for</strong> different numbers<strong>of</strong> features. The confidence intervals are [(n−1)s 2 /χ 2 α/2,n−1 , (n−1)s2 /χ 2 1−α/2,n−1 ],where χ 2 α is the α = 0.05 quantile <strong>for</strong> a chi-squared distribution with n − 1 degrees<strong>of</strong> freedom, and s 2 is the sample variance.75


4. Gene Sets <strong>for</strong> Breast Cancer Prognosis2000Gene2000Gene Set15001500Rank100010005005000● ● ● ● ● ● ● ● ● ● ● ● ● ● ●0● ● ● ● ● ● ● ● ● ● ● ● ● ● ●MYBL2 (201710_at)SORL1 (212560_at)DUS2L (47105_at)RP6−213H19.1 (218499_at)NCAPH (212949_at)RARRES1 (206391_at)CD302 (203799_at)SLC44A4 (205597_at)NCAPG (218662_s_at)BUB1B (203755_at)TFRC (208691_at)C4orf18 (219872_at)FAM38A (202771_at)POLQ (219510_at)SNRPA1 (206055_s_at)chr4pDAC_FIBRO_DNGNF2_MKI67chr1q11MIDDLEAGE_DNGNF2_CCNA2GNF2_HMMRGNF2_CDC20GNF2_PCNAGNF2_RRM2GNF2_TTKGNF2_H2AFXZHAN_MM_CD138_PR_VS_RESTP21_EARLY_DNGNF2_CDC2Figure 4.5.: Mean and 2.5%/97.5% <strong>of</strong> the ranks <strong>of</strong> genes and gene sets. Ranks are based onthe weight assigned by the centroid classifier to each feature. For gene sets, weused the set centroid statistic. The process was repeated over 5000 bootstrapreplications <strong>of</strong> the GSE4922 dataset. Features have been sorted by their meanrank.76


4.3. ResultsCorrelation0.60.40.20.0●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●rawset.centroidsset.mediansset.medoidsset.pcsset.t.statset.u.stat.pval.logFigure 4.6.: Spearman rank-correlation <strong>of</strong> the centroid classifier’s weights from the fivedatasets (n = 10 comparisons). Individual genes are denoted raw.4.3.3. Concordance between DatasetsWe were interested in measuring how the different datasets agreed on the importance <strong>of</strong> thefeatures (genes or gene sets).We used two <strong>approaches</strong>: rank correlation <strong>of</strong> the centroidclassifier’s weights, and concordance <strong>of</strong> the feature lists. For this section, the classifier wasnot bagged — we trained a single classifier on each dataset. We note that each dataset wasindependently normalised — we are interested in agreement between datasets despite the factthere may be some differences between them that were not captured by the predictive model,such as unknown batch effects or other confounders.A high level <strong>of</strong> agreement betweenindependent models is a strong indication that the predictive ability <strong>of</strong> the models is due totrue biological signal and not due to confounding.We measured concordance between the classifier weights estimated from each dataset usingSpearman rank-correlation, a total <strong>of</strong> ( 5 )2= 10 comparisons (comparisons are symmetric),as shown in Figure 4.6. It is evident that the rank-correlation <strong>for</strong> the weights <strong>of</strong> the setcentroids,set median, set medoid, and set t-statistic are higher than <strong>for</strong> individual genes.This indicates that classifiers built from features based on gene sets are more stable thanthose built using individual genes, and are less likely to overfit. The set U-statistic showedthe lowest concordance <strong>of</strong> all measures considered, including individual genes.To measure how the ranked lists produced by each dataset agreed on the top-ranked genes,77


4. Gene Sets <strong>for</strong> Breast Cancer Prognosiswe used the following approach. The features <strong>of</strong> each dataset were ranked by the absolutevalue <strong>of</strong> their w weight. However, lists lengths can affect the apparent concordance betweendatasets in a subtle way: lists with smaller numbers <strong>of</strong> items to rank will achieve higherconcordance on average than longer lists, simply by virtue <strong>of</strong> having fewer items to choosefrom. There<strong>for</strong>e, we ensured that number <strong>of</strong> items being ranked was identical across all tasks:we used the 4120 gene sets selected by the set t-statistic as the basis <strong>for</strong> the list <strong>of</strong> all othergene sets, and <strong>for</strong> the genes we selected 4120 genes selected by one dataset and used the samelist across the other four gene datasets. Then <strong>for</strong> each number f <strong>of</strong> features, f = 1, ..., p, wechose each dataset’s top f ranked features. Next we counted how many <strong>of</strong> these f featuresoccurred in at least k <strong>of</strong> the five datasets. Results <strong>for</strong> k = 5 are shown in Figure 4.7. Listsbased on individual genes show no overlap whatsoever <strong>for</strong> cut<strong>of</strong>fs up to about 70 — there areno genes that occur in all five lists <strong>of</strong> length 70 — and very low overlap even at 200 genes. Incomparison, the set statistics, especially the set medians and the set centroids, produce listswith higher overlap, even at cut<strong>of</strong>f below f = 10. This result further supports the conclusionthat lists <strong>of</strong> individual prognostic genes are highly unstable, even when developed on the samedataset (Ein-Dor et al., 2005; Michiels et al., 2005). In contrast, aggregation into set-basedfeatures greatly increases the stability and hence the interpretability <strong>of</strong> the results.4.3.4. Analysis <strong>of</strong> Predictive MSigDB SetsHere we analyse which MSigDB sets were highly predictive <strong>of</strong> distant metastasis, using thecentroid classifier, and examine which genes are over-represented in these sets. Table 4.3shows the top 10 gene sets by rank, where the rank was averaged over the feature ranksfrom the five datasets, using the set centroid statistic. Also shown in enrichment <strong>for</strong> theGO Biological Process (BP) terms from a Bonferroni-adjusted Fisher’s exact test, <strong>for</strong> thegenes belonging to these sets. The top sets are enriched <strong>for</strong> GO BP terms related to thecell cycle and cell division processes, and <strong>for</strong> the PI3K pathway, which interacts with theRas oncogene (Downward, 2003), confirming the cell cycle as one <strong>of</strong> the major biologicalmechanisms associated with breast cancer metastasis (Dai et al., 2005; Mosley and Keri,2008). The top gene set, GNF2 MKI67, represents the neighbourhood <strong>of</strong> Ki-67, a well-knownclinical marker <strong>of</strong> cancer proliferation (van Diest et al., 2004).# Set Cat. Sign MSigDBDescriptionEnriched GO BP Terms (adj. p-value)78


4.3. Results1 GNF2 MKI67 C4 −1 Neighborhood<strong>of</strong> MKI672 GNF2 CCNA2 C4 −1 Neighborhood<strong>of</strong> CCNA23 GNF2 TTK C4 −1 Neighborhood<strong>of</strong> TTK4 GNF2 HMMR C4 −1 Neighborhood<strong>of</strong> HMMR5 GNF2 CDC20 C4 −1 Neighborhood<strong>of</strong> CDC20“phosphoinositide-mediated signaling”: 1.95×10 −10 ,“spindle organization”: 5.86×10 −6 , “establishment <strong>of</strong>mitotic spindle localization”: 1.10×10 −5 , “kinetochoreassembly”: 5.48×10 −5 , “mitotic chromosomecondensation”: 1.37×10 −4 , “protein complexlocalization”: 2.55×10 −3 , “regulation <strong>of</strong> striated muscledevelopment”: 2.55×10 −3 , “metaphase platecongression”: 2.55×10 −3“phosphoinositide-mediated signaling”: 4.05×10 −16 ,“DNA replication”: 1.04×10 −9 , “mitotic chromosomecondensation”: 1.32×10 −8 , “regulation <strong>of</strong> striated muscledevelopment”: 3.76×10 −3 , “metaphase platecongression”: 3.76×10 −3“phosphoinositide-mediated signaling”: < 2.22×10 −16 ,“mitotic chromosome condensation”: 4.35×10 −14 , “DNAreplication”: 1.01×10 −12 , “spindle organization”:1.37×10 −9 , “establishment <strong>of</strong> mitotic spindlelocalization”: 9.59×10 −5 , “kinetochore assembly”:4.76×10 −4 , “DNA repair”: 5.78×10 −3 , “mitosis”:9.44×10 −3“phosphoinositide-mediated signaling”: < 2.22×10 −16 ,“mitotic cell cycle spindle assembly checkpoint”:1.26×10 −11 , “spindle organization”: 4.89×10 −10 , “mitoticchromosome condensation”: 8.46×10 −8 , “cellproliferation”: 6.22×10 −6 , “DNA replication”:1.09×10 −5 , “establishment <strong>of</strong> mitotic spindlelocalization”: 5.33×10 −5 , “kinetochore assembly”:2.65×10 −4 , “protein complex localization”: 8.29×10 −3 ,“regulation <strong>of</strong> striated muscle development”: 8.29×10 −3 ,“metaphase plate congression”: 8.29×10 −3“phosphoinositide-mediated signaling”: < 2.22×10 −16 ,“spindle organization”: 2.20×10 −12 , “mitotic cell cyclespindle assembly checkpoint”: 4.07×10 −11 , “mitoticchromosome condensation”: 1.52×10 −9 , “cellproliferation”: 8.96×10 −9 , “mitosis”: 1.83×10 −8 ,“establishment <strong>of</strong> mitotic spindle localization”:8.95×10 −5 , “kinetochore assembly”: 4.45×10 −4 , “DNAreplication”: 7.83×10 −3 79


4. Gene Sets <strong>for</strong> Breast Cancer Prognosis6 GNF2 SMC2L1 C4 −1 Neighborhood<strong>of</strong> SMC2L17 GNF2 H2AFX C4 −1 Neighborhood<strong>of</strong> H2AFX8 GNF2 ESPL1 C4 −1 Neighborhood<strong>of</strong> ESPL19 GNF2 RRM2 C4 −1 Neighborhood<strong>of</strong> RRM210 GNF2 PCNA C4 −1 Neighborhood<strong>of</strong> PCNA“mitotic cell cycle spindle assembly checkpoint”:5.15×10 −13 , “mitotic chromosome condensation”:7.16×10 −9 , “phosphoinositide-mediated signaling”:2.14×10 −6 , “establishment <strong>of</strong> mitotic spindlelocalization”: 1.31×10 −5 , “kinetochore assembly”:6.51×10 −5 , “protein complex localization”: 2.90×10 −3 ,“DNA strand elongation during DNA replication”:2.90×10 −3 , “regulation <strong>of</strong> striated muscle development”:2.90×10 −3 , “metaphase plate congression”: 2.90×10 −3 ,“cell proliferation”: 2.94×10 −3 , “nucleotide-excisionrepair, DNA gap filling”: 3.56×10 −3“cell proliferation”: 9.28×10 −10 ,“phosphoinositide-mediated signaling”: 5.54×10 −7 ,“mitosis”: 8.48×10 −5 , “mitotic cell cycle spindleassembly checkpoint”: 1.33×10 −4 , “protein complexlocalization”: 1.63×10 −3“phosphoinositide-mediated signaling”: 5.38×10 −11 ,“kinetochore assembly”: 3.12×10 −5 , “mitoticchromosome condensation”: 6.75×10 −5 , “spindleorganization”: 7.76×10 −4 , “protein complexlocalization”: 1.67×10 −3 , “regulation <strong>of</strong> striated muscledevelopment”: 1.67×10 −3 , “metaphase platecongression”: 1.67×10 −3“phosphoinositide-mediated signaling”: 4.52×10 −15 ,“mitotic cell cycle spindle assembly checkpoint”:1.17×10 −9 , “spindle organization”: 1.20×10 −7 , “DNAreplication”: 5.42×10 −6 , “cell proliferation”: 1.97×10 −5 ,“establishment <strong>of</strong> mitotic spindle localization”:4.09×10 −5 , “kinetochore assembly”: 2.03×10 −4 , “proteincomplex localization”: 6.80×10 −3 , “regulation <strong>of</strong> striatedmuscle development”: 6.80×10 −3 , “metaphase platecongression”: 6.80×10 −3“phosphoinositide-mediated signaling”: < 2.22×10 −16 ,“DNA replication”: 1.47×10 −15 , “mitotic chromosomecondensation”: 2.36×10 −7 , “spindle organization”:4.33×10 −7 , “establishment <strong>of</strong> mitotic spindlelocalization”: 9.59×10 −5 , “cell proliferation”: 4.18×10 −4 ,“DNA repair”: 4.33×10 −4 , “kinetochore assembly”:4.76×10 −4 , “mitosis”: 9.44×10 −3Table 4.3.: Top 10 gene sets by average rank over the five datasets, using the set centroidstatistic. GO enrichment p-values are from a Bonferroni-adjusted one-sidedFisher’s exact test (30,330 tests). Sign=−1 if expression is negatively associatedwith long-term survival, and is +1 otherwise. The background list <strong>for</strong> the testincludes all Affymetrix HG-U133A probesets that could be mapped to GO BPterms, excluding IEA annotations.80


4.3. ResultsFeatures in ≥ 5 datasets10080rawset.centroidsset.mediansset.medoidsset.pcsset.t.statset.u.stat.pval.log# common60402000 50 100 150 200threshold fFigure 4.7.: Concordance <strong>of</strong> feature lists (genes or gene sets) <strong>for</strong> different cut<strong>of</strong>fs f = 1, . . . , 200,counting the number <strong>of</strong> features occurring in all <strong>of</strong> the five datasets’ lists, rankedhigher than f. We use raw to denote individual genes. Prior to ranking, weselected 4120 genes (<strong>for</strong> the raw lists) or gene sets (<strong>for</strong> the set statistics) to beranked, so that the number unique <strong>of</strong> items was identical across all lists.81


4. Gene Sets <strong>for</strong> Breast Cancer PrognosisThe potential advantage <strong>of</strong> gene sets signatures over individual gene signatures depends onthe degree <strong>of</strong> these genes’ coexpression. A critical aspect <strong>of</strong> this per<strong>for</strong>mance, there<strong>for</strong>e, isthe source <strong>for</strong> the grouping <strong>of</strong> genes into sets. The MSigDB is composed <strong>of</strong> five set classesdepending on the annotation used to define the sets. Whereas categories C1 and C3 arederived from chromosomal locations and sequence <strong>of</strong> regulatory elements, respectively, categoriesC2 and C4 both originate from pathway in<strong>for</strong>mation and expression pr<strong>of</strong>iles related tocancer; C5 is based on GO categories. In contrast with the other four categories, GO sets donot necessarily define co-expressed or co-regulated genes. Hence, it may not be meaningfulto <strong>for</strong>m set statistics over some <strong>of</strong> these sets. In addition, the datasets these categories arebased on vary with respect to sample size; whereas C4 was based on hypothesis-free examination<strong>of</strong> co-expression across almost two thousand expression pr<strong>of</strong>iles, C2 is based mainlyon publication <strong>of</strong> expression pr<strong>of</strong>iles, rarely using more then dozens <strong>of</strong> samples.To see whether different MSigDB categories were more useful <strong>for</strong> predicting metastasis, wecombined four datasets (GSE2034, GSE4922, GSE6532, and GSE7390) into a single trainingset. A separate centroid classifier was trained on each gene set, using the set centroid classifier,and the gene sets were then ranked by the centroid classifier weights (negative to positive).We then tested the classifiers on the first dataset GSE11121. Finally, we used the two-sampleKolmogorov-Smirnov statistic to compare the ranks from the different categories.Gene Set Enrichment Analysis (GSEA) (Mootha et al., 2003; Subramanian et al., 2005) usesthe counting <strong>for</strong>mulation <strong>of</strong> the two-sided two-sample Kolmogorov-Smirnov statistic (Hollanderand Wolfe, 1999, p. 182), to quantify the distribution <strong>of</strong> genes belonging to some geneset relative to genes not in this set. Note that this statistic does not quantify the deviationfrom uni<strong>for</strong>m randomness (which would require the one-sample Kolmogorov-Smirnovtest), but deviation <strong>of</strong> sets from each other. Equivalently, we used the two-sided two-sampleKolmogorov-Smirnov statistic <strong>for</strong> testing <strong>for</strong> enrichment <strong>of</strong> categories belonging to a givenMSigDB category (C1, C2, C3, C4, C5).First we define the <strong>for</strong>m <strong>of</strong> the Kolmogorov-Smirnov statistic used here. Let F (t) and G(t)be the cdfs (cumulative density functions) <strong>of</strong> the two continuous random variables X and Y .The null and alternative hypotheses are, respectively,H 0 ∶ F (t) = G(t) <strong>for</strong> all t, H A ∶ F (t) ≠ G(t) <strong>for</strong> at least one t. (4.10)The two-sided two-sample Kolmogorov-Smirnov statistic isK = sup ∣F n (t) − G m (t)∣ , (4.11)twhere F n (t) = 1 n ∑n i=1 I(x i ≤ t) and G m (t) = 1 m ∑m i=1 I(y i ≤ t) are the empirical cdfs <strong>of</strong> the twosamples (n samples from X and m samples from Y ), respectively, and I(⋅) is the indicator82


4.3. Resultsfunction (1 if the argument is true and 0 otherwise). K is computed asK = maxi=1,...,N ∣F n(z i ) − G m (z i )∣ , (4.12)where z are the combined samples (x 1 , . . . , x n , y 1 , . . . , y m ), that have been ordered in ascendingorder, such that z 1 ≤ z 2 ≤ ⋯ ≤ z N , N = m + n. Our <strong>for</strong>mulation here differs from (Hollanderand Wolfe, 1999, pp. 178–179) in that we do not multiply K by mnd .Under the assumption that X and Y are continuous random variables, there are no tiesbetween F n (t) and G m (t), there<strong>for</strong>e at each t, the difference F n (t)−G m (t) can either increaseby 1/n or decrease by 1/m, but not both. Hence, K can also be calculated using a cumulativesum SK = max ∣S∣ , (4.13)i=1,...,NwhereS j = S j−1 + δ j , δ j =⎧⎪⎨⎪⎩+1/n if z j is from X,−1/m if z j is from Y .j = 1, . . . , N, (4.14)and S 0 = 0.In GSEA, the cumulative sum S is plotted to show the relative location <strong>of</strong> each gene set.Similarly, we plot S to show the location <strong>of</strong> the MSigDB categories in the ranked sets —<strong>for</strong> each category C k , k ∈ {1, 2, 3, 4, 5}, we take X to represent the weights <strong>of</strong> the sets fromC k (weights are averages over the five datasets), and Y to represent the weights <strong>of</strong> the setsoutside the category, that is, C {1,2,3,4,5}∖k . The cumulative sum S k is then computed <strong>for</strong> eachcategory C k .Kolmogorov-Smirnov p-values are conservative (larger) in the presence <strong>of</strong> ties (Pratt andGibbons, 1981, pp. 330–331), hence we do not correct <strong>for</strong> tied ranks. The p-values were basedon the two-sample two-sided test, using ks.test in the R statistical package (R DevelopmentCore Team, 2011).Figure 4.8 shows the cumulative-sum statistic, from which the Kolmogorov-Smirnov statisticis computed, <strong>for</strong> the ranked gene lists. In order to link that list with per<strong>for</strong>mance insample classification, we plotted the centroid classifier’s AUC value <strong>for</strong> each <strong>of</strong> these setsalong the rank (considering only one set at a time). The results show that the C4 sets tendto have more extreme centroid weights, especially towards the negative side, than the othercategories. In contrast, C2 weights show a concentration towards the positive weights, albeitmuch smaller. Category C3 tends to be concentrated in the middle ranks, and category C1tends to be concentrated in the negative to middle ranks. Finally, category C5 is distributedmore uni<strong>for</strong>mly across the ranks.These results show that the highly-predictive sets tend to be C4 sets, and to a lesser extentC2. There is no enrichment <strong>for</strong> C1, C3, and C5 sets, showing that these sets are, as awhole, not useful <strong>for</strong> breast cancer metastasis prediction. For C5, this may be since GO83


4. Gene Sets <strong>for</strong> Breast Cancer Prognosis0.3 0.4 0.5 0.6 0.7 0.8−0.4 −0.2 0.0 0.2 0.4−0.2 −0.1 0.0 0.1 0.2AUC●●● ●●●●● ●●●●●●● ●●●●●● ●● ●● ●●●●● ●●●●●●●● ●●●●●●●●●●● ●●●●●● ●●● ●●● ●●● ●● ●●●● ●● ●● ●●● ●● ●●●●● ●● ●●● ●● ●●●●●●●● ●● ●● ●●●● ●●● ●● ●●●●● ●● ●●●●●●● ●● ●●● ●●●● ●● ●●●●●● ●●● ●●●●● ●● ●● ●● ●●●●●●●●●●● ●●●● ●●●● ●●● ●●● ●●●●● ●●●● ●● ● ●● ●●●●●●●●●0 1000 2000 3000 4000 5000Weight0 1000 2000 3000 4000 5000Score0 1000 2000 3000 4000 5000RankC1 C3 C5C2 C4Figure 4.8.: Kolmogorov-Smirnov enrichment <strong>for</strong> MSigDB categories, using the set-centroidstatistic. (A) AUC and spline smooth <strong>for</strong> each set, tested on GSE11121. (B)Number <strong>of</strong> mapped probesets in each set, on log 2 scale, and spline smooth. (C)Two-sample Kolmogorov-Smirnov Brownian-bridge <strong>for</strong> each MSigDB category(p-values: C1: 1.44×10 −4 , C2: 3.55×10 −15 , C3: < 2.22×10 −16 , C4: 4.22×10 −13 , C5:2.38×10 −2 ).84


4.3. ResultsClass < 5 years ≥ 5 years Total1 ER−/HER2− 35 80 1152 ER+/HER2− 107 423 5303 HER2+ 55 164 219Table 4.4.: Breakdown <strong>of</strong> samples <strong>for</strong> each cancer subtypesets do not take the direction <strong>of</strong> expression change into account, potentially leading to setscomposed <strong>of</strong> genes with a mixture <strong>of</strong> positive and negative correlations, which may canceleach other out if averaged over. There was no positive enrichment <strong>for</strong> the C1 and C3 sets,which represent positional gene sets and motif gene sets, respectively, thus there is no evidencethat chromosomal aberrations (<strong>for</strong> C1) and changes in cis-regulation (<strong>for</strong> C3) are significantdrivers <strong>of</strong> breast cancer metastasis, however, the coverage <strong>of</strong> these gene sets is probably toolimited to conclusively preclude these categories as factors in metastasis.One possible problem with the set-centroid statistic is that, <strong>for</strong> small sets, there is a higherprobability <strong>of</strong> observing a spurious extreme statistic, since the variance <strong>of</strong> the sample meandecreases with set size, and we are considering potentially thousands <strong>of</strong> such sets. Thisimplies that spurious set centroids (high absolute value) would be more common in smallersets, leading to a bias towards smaller sets when ranking them. However, there does not seemto be a monotonic relationship between the log-size and rank (Figure 4.9). Additionally, thereis reasonable concordance between the sets as independently ranked by the five datasets. Weconclude that, while spurious effects due to set size cannot be ruled out, they do not seem tobe a major factor in the set’s rank. When such effects are <strong>of</strong> concern, an alternative to theset centroid can be used, such as the set t-statistic that corrects <strong>for</strong> differences in set sizesand set variances.4.3.5. Prognostic Gene Sets in Breast Cancer Molecular SubtypesBreast cancer is a heterogeneous disease, with gene expression segregating the cases intodifferent biological and clinically relevant subtypes, potentially implying differing biologicalmechanisms <strong>for</strong> tumour growth and progression, and suggesting separate cells <strong>of</strong> origin (Perouet al., 1999; Sotiriou and Piccart, 2007; Sotiriou and Pusztai, 2009). Several molecular subtypeclassifications have been proposed, the most well known <strong>of</strong> which is the Stan<strong>for</strong>d “intrinsic”classification (Perou et al., 2000; Sørlie et al., 2001, 2003) (named after the “intrinsic” subset<strong>of</strong> highly differentially-expressed genes on which the classification is based), which usedhierarchical clustering <strong>of</strong> gene expression data to define five molecular subtypes: basal-like,ERBB2+ (also called HER2+), normal-breast-like, and luminal A and B (some definitionsinclude the luminal C type as well). These classifications have important clinical implications;<strong>for</strong> example, HER2+ cases can be treated by trastuzumab (commercially known as85


4. Gene Sets <strong>for</strong> Breast Cancer PrognosisWeightAUC0.30.40.50.60.70.8−0.4 −0.2 0.0 0.2 0.4●●●● ● ● ●● ●●●●●●●●●●●●●● ●● ●● ●● ●● ● ●● ● ●●● ● ● ●●●● ● ● ● ● ● ●●●● ● ● ●● ●●● ●●●●●● ● ● ● ●● ●● ●● ● ● ● ●● ● ● ● ● ●● ● ●●●●● ● ●●●● ●● ●● ● ●● ●● ● ● ● ● ●●●● ●● ●●● ●● ● ● ●●●●●●● ●●●● ● ●● ●● ● ● ●●● ● ●●●●●● ●● ●●●●●● ● ●●●● ● ● ●● ● ● ● ●● ● ● ● ● ●●●●●● ●● ● ●●● ● ●● ●● ● ● ●●●●● ● ●●●●● ● ● ● ● ●● ●● ● ● ● ● ●● ● ●●●● ● ● ● ● ●● ● ● ● ●● ●●●●● ● ● ● ●● ● ●● ● ● ● ●● ●● ●●●●● ● ● ● ●●●● ●● ●● ● ● ●C1●●●●● ● ● ● ●●●● ● ●●●● ●●●●● ●●●● ●● ● ● ●●●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●●● ●● ●● ●●● ●●● ●●●●●● ●● ● ●●● ● ● ● ●●● ● ●●●●●● ●● ●●● ● ●●●●●●●●●● ●● ● ●●●● ● ● ●●●●●●●●● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ●● ●●● ●●● ● ● ● ● ●● ● ● ●●●●●●● ● ● ● ●●● ● ●●● ● ●●●●●●●● ●● ● ●● ●●● ● ●● ● ●●●● ●●●● ● ● ● ● ●●●● ● ●● ● ●●● ● ● ●●● ● ●●●●● ●● ●●●● ● ●● ● ●●●●●●●●●●●● ● ●● ●●● ● ● ●●●● ● ●●● ● ● ●● ●●● ●●●●● ● ●●●●●●●● ● ●●● ● ● ● ●● ● ● ● ●●● ● ● ● ● ●● ●●●● ● ●●● ●● ●●● ●●●●●●● ● ●● ● ●● ●● ●●●●●●●● ●● ●● ● ●● ● ●●●●●●●●● ●●● ●● ●●●●●● ● ●●●●● ● ● ● ●● ● ●● ● ●●●●● ● ●●●●●●● ● ●●●●●●● ● ● ● ●●●●●●●● ●●●●● ● ●●●●●●● ● ● ● ● ●● ●●● ●●● ● ●● ●●●● ● ● ●● ● ● ●● ● ●● ● ●●●● ● ● ● ●●● ● ● ● ●●●●● ●● ●● ● ●●●●●●●● ● ●●●●●● ●● ●●●●●●● ● ●●●●● ● ●● ● ●●●● ● ● ● ●●● ●●● ●● ● ● ● ●● ● ●●●●● ● ● ● ● ●●●●● ● ●●●● ● ●● ● ● ● ●●●● ● ● ● ● ●●● ● ●●● ● ● ●●● ● ●●● ●●●●●● ● ● ●●● ● ●●●● ●●● ●● ●● ● ● ● ● ● ●● ● ● ● ●● ●●●● ●●● ● ● ●● ●● ●●● ● ● ● ●●●●●●●●●● ● ●● ● ● ● ● ●●● ● ●● ●● ● ● ● ● ● ● ●●● ● ● ●●●●●●●●●● ● ● ● ● ● ●●●● ●●● ●●●●● ●●● ● ●●●●●●●●● ● ●● ● ● ●● ●●●●●●●●●●●●● ●● ●●●●● ●● ● ● ●● ● ●● ● ●● ●● ●● ●● ● ● ● ●● ●●●●● ● ● ● ● ● ●●● ● ● ● ●●● ● ●●●● ●●●● ● ●●●●●●●●●●●●● ● ● ● ●● ●● ● ● ● ● ● ● ●●● ● ●●●● ● ● ● ●●●● ● ●●● ● ●● ● ●●●●● ● ● ● ● ●●● ● ● ● ● ●●●●● ● ●●●●●●●●●●●●●● ● ●●● ● ●●●● ●● ●●● ●●●● ● ● ● ●●●● ● ●●● ● ●● ●● ●● ●●●●●● ●● ● ● ●●● ● ● ● ●● ● ● ● ● ●●●●● ● ●●● ● ● ● ●● ● ●● ● ● ●● ●●●●●● ● ●●● ●● ●●●●●●●● ●● ●● ●●● ● ●●●●●●●● ●● ●●●● ●● ●● ●●●●● ● ● ●●●● ●●●● ● ●●● ● ● ●● ●●●●●●●● ● ●●●●●● ● ●● ● ● ● ●●●●●●● ● ● ● ● ●● ● ● ● ● ●●●●●●●● ● ●●●●●●● ● ●● ● ● ●● ● ● ●● ● ●●●●●●● ● ● ●● ●●● ● ● ●● ●● ● ●●● ● ●● ●●● ● ● ● ●●●● ●● ● ● ● ● ●● ●● ●● ● ● ●● ● ● ●●●●● ●●● ● ●●●●●●● ● ●● ● ● ●●● ● ●●●●●●●●●● ●● ● ● ●● ● ●●● ● ●● ●●● ●●●●●● ● ● ●● ● ●●● ● ●●● ● ●●●●●● ● ●●●●●●●● ● ● ● ●● ● ● ●●●● ●● ● ● ● ● ● ● ●●●● ● ● ● ●●● ●●● ● ●●● ●● ●●●● ●● ●●●●● ●● ● ● ● ● ●●●●● ●●● ● ● ● ●●●●● ● ● ● ● ●● ●●● ● ●●C2●●●● ● ● ● ● ● ● ● ●●● ● ●● ● ●●●●●● ● ●●●●●●●●●●● ● ●●● ● ● ● ● ●● ●● ●● ● ● ● ● ●● ● ● ●● ●● ●●●●● ● ●●●●● ●●●●●● ● ●● ● ● ● ● ● ● ● ●● ●●●●●●● ●● ●●● ●● ● ● ● ● ● ● ●● ● ● ● ●●●●● ● ●●●●●● ●● ●●●● ● ●●●●●● ● ● ●●● ●● ● ●●●●●●● ●●●●● ● ●● ● ●● ● ●●●●●●●●● ● ●●●●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●●● ● ● ● ●●● ● ●●● ●● ●● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ●● ●●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ●●● ●● ● ● ● ●●● ● ● ●● ● ● ●● ●● ●●● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ●●● ● ● ● ● ●●●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●●●●●●●● ● ●●●●● ●●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ●● ● ● ● ●●●● ● ● ● ●● ●●●● ● ● ● ●●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ●● ● ●● ● ● ● ●●● ●●●C30.30.40.50.60.70.8● ●● ● ●●● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ●● ●● ●●●●●● ● ● ● ● ●● ●● ●●● ●● ● ● ●●● ● ● ●● ● ●●●● ● ● ● ● ●●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●●●● ●● ●● ● ●●● ● ●● ● ● ● ●● ● ● ●● ● ● ●●●●●●●● ●●●● ●●●● ●● ●●●● ● ● ● ● ● ●●●●●● ● ● ● ●● ●● ● ●● ● ●●●● ● ● ●● ● ● ●●●● ●● ● ● ● ● ● ● ●●●● ● ● ● ●●●●●●●● ● ●●●●●●● ●●● ●● ● ● ● ● ●● ● ●●●● ● ● ● ● ●●●●●● ●● ● ● ●● ● ● ●● ● ● ● ● ● ●●●●● ● ● ●●● ●● ●● ●● ● ●● ● ● ●●● ● ● ● ● ● ●●●● ● ●●●● ●● ● ●●●● ●●● ● ● ● ● ●●●●●● ●● ● ● ● ●● ●●●●● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ●●● ● ● ●●● ● ●●● ●● ●●●●● ● ●●●● ● ● ●●●● ● ●●● ●●● ●● ●●●●●● ● ● ●●●●●● ●● ● ● ●●●● ● ●●●● ● ● ● ●●●●●●●●●●● ● ●●●● ● ● ●● ● ●● ● ●●●● ● ●●● ● ● ● ●●● ● ● ●●●●●● ● ●● ●● ●●● ● ● ● ● ●● ●●● ●● ● ● ●● ● ● ● ●●● ●● ● ● ● ● ●●●●●● ● ●●● ● ● ●● ● ●●● ●● ● ● ● ●●● ● ● ● ●● ● ●●●●●● ● ●● ●● ● ●● ● ● ● ●●●●●● ●●● ●● ● ● ●● ●● ●●●● ● ● ● ●● ●●●●●● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ●●● ● ●●● ● ● ●●C40.30.40.50.60.70.8● ● ●● ● ● ●● ● ●●● ●●●● ●● ● ●●● ● ● ● ● ● ● ● ● ● ●● ●●●●● ●● ● ● ● ● ● ●●●●● ●●●●●●●●●●●●● ●●●●●●●● ● ●●●●●● ● ● ● ●●●● ●●●●●● ● ● ● ●●●●● ● ● ●● ●●● ● ●●● ● ●●●●●● ● ●●● ● ● ● ● ● ●●●●●● ● ● ●● ●● ● ●● ● ●●●● ● ● ● ●● ● ● ● ● ● ●●●● ● ● ● ●●● ●● ●● ● ● ● ● ●●● ● ●●●●●● ● ● ●● ●●● ● ●●●●●●● ● ●● ●● ●● ●●●●●●● ●● ●●● ●●●●●● ● ● ● ● ● ●●●●●●●● ● ●●●●● ● ● ● ●●● ● ●● ● ●●●● ● ● ● ●●● ●● ● ● ● ● ●● ●●●●● ● ●● ● ●●●● ● ● ●● ● ● ● ● ●● ●● ● ● ●●●●●●●●● ● ●●● ● ● ●●● ●● ● ●●●● ● ● ● ●●●●●●●● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ●●●● ● ● ● ● ● ●● ●●● ● ●● ●● ● ● ●●●●● ● ● ● ● ● ● ● ● ● ● ●● ●●●● ● ●●●● ●● ● ● ● ●● ● ●● ●● ● ● ● ● ●●●●●●●●● ● ● ● ●●● ● ● ● ●●●●●● ●●● ● ● ●● ●● ●●●●●●● ●●● ●● ● ●●● ● ● ● ● ●●● ● ●●●●● ● ● ● ● ●● ● ● ● ●●●●●●●●●●● ● ●●●●●●●●●●● ● ●●● ●● ● ●●● ●● ●●● ●● ● ● ● ● ● ●●● ● ● ● ● ● ●●● ● ●● ●●●●● ●●●●●● ● ● ●●● ●●● ● ● ●●●● ● ●● ● ●● ● ● ●● ● ● ● ● ● ●●● ● ●●●●●●●●●●●●●●●●●●● ● ● ●● ●●●● ●●●● ● ● ● ● ●● ● ● ● ●●● ● ●●●●● ● ●●●● ●●● ● ●●●● ● ● ●● ● ●● ● ● ●●●●●●●● ●●●●●●●● ●● ● ●● ● ●●● ● ● ●● ●●● ●●●● ●● ● ●●●●● ●● ●● ●● ●●●●●●●●●● ● ●●●●●●● ● ●● ●● ● ● ●●●●●● ● ● ●●●●●●●●●● ● ● ● ● ● ●● ●● ● ● ●●●● ● ● ● ● ●●● ● ●●● ● ● ● ●● ● ●●● ● ● ●● ●● ● ● ● ●●●●●●●●●●● ● ●●● ● ● ● ● ●● ● ● ●● ●●●● ● ●● ●● ●●●● ● ●● ●● ● ●●●●●●● ● ●●● ● ●● ● ●● ● ●● ● ●●● ●●●●● ● ● ● ● ●●●●●● ●● ● ●●● ●●● ●● ● ●● ● ●● ● ●●●●●●●●●● ●● ● ● ●● ●● ● ●●●● ● ● ● ● ● ● ● ● ●● ●●● ● ●●●●●● ● ●●●●● ● ● ●●●●● ● ●● ● ●●●● ● ●●●● ● ● ●●●●● ● ● ● ●●●●● ● ● ● ● ● ●●●●●● ● ● ● ● ● ●●● ● ●● ● ● ●●●●● ● ●●●●●● ● ● ● ● ● ● ●● ●● ● ●● ●● ●●●● ● ●●●● ●●● ●● ●●● ● ● ● ●C5SizeWeight−0.4−0.20.00.20.40 1000 2000 3000●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●● ●● ●●●● ●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●C1● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●● ●●●●●● ●●●●●●●●● ●●●●●●●● ●●●●●●●●● ●●●●● ●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ● ●●●● ●● ●●● ●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●● ● ●● ●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ● ●● ●●●●●● ●●●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●● ●●● ●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●● ●●●●● ●● ●●● ● ●●●●●●●●● ●●●●●● ●●●●●●● ● ●● ●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●● ●●●● ●●●●● ●●●●●●●●●●●● ●●●● ●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●●●●●●●●● ●●●●●● ●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●● ●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●● ●●●●●● ●●●●● ●●●● ●●●●●●●●●C2● ●●●●●● ●●●● ●●●●●● ●●●●●●● ● ●●● ●● ●●● ● ●●●●●●●● ●●●● ●●●●●●● ●●●●●●● ●● ●●● ●●●● ●● ●● ●●●●● ● ●●● ●●●●●● ●●●●●● ●●● ●●●●●●●●●●●● ●●● ●● ●●●● ● ●●●●●●●●● ●●●●●●●●●●●●●● ●● ● ●●●● ●● ●●●●●●●●●●●●●●● ● ●● ●●● ●●● ● ●●●●●● ●●●●●●● ●●●●●●●●●●● ●●●●●●● ●●●● ● ●●●●●● ● ●●●●●●●●●●●●●●●●●●●● ● ●● ●●●●●●●●●●●●●●●●● ●● ●● ●●● ●●●●●●●●●●● ● ●● ●●●●●●●●● ●● ●● ●● ● ●●●● ● ●● ●● ● ●●●●● ● ●● ●●●●●●●●●●● ●●●● ●● ●●● ●●●●● ●● ●●●●●●● ● ●●●●●●●● ●●●● ● ●●●●●●●● ●●●●●●●●●●●●●● ● ●● ●●●●● ● ●●●● ● ●● ●●●●● ●●●● ●●●●● ●●●●●● ●●●●●●●● ●●● ●●●●●●● ●●●●●●●●●●●●●● ● ●●●●●● ● ●● ●●●●●●●●● ●● ● ●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●● ●● ●● ●●● ●●●●●●●●●●● ●● ●● ●●●● ●●●●●●●●● ●●● ●●●● ●●● ●● ●●●●●●●●●● ●●●●●●●●●●●●●●● ●● ●●●●●●●●●● ●●●●●●●● ●●●● ●●●●●● ●●●●●●●●●●●●C3−0.4−0.20.00.20.4●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●● ● ●●●●● ●●●●●●●●●● ●●●●●●●●●●●●● ●●● ●● ●●●●●●●● ●●●●●●●●●● ● ●●●● ●●● ●●●●●●●●●●●●●●●● ●● ●●●●●●●●●● ●● ●●●●●●●●●●● ●● ●● ●●●●●●●●●● ●●●●●●●●●●●●● ●●●● ●●●●●● ●●●●●●●●●●● ●●● ● ●●●● ●●●●● ●●●●●●● ●●●●●●●●●●● ●● ●●●●● ●●●●● ●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●● ●●●●●● ●●●● ●●●●● ●● ● ●●●●●●●●●●● ●● ●●● ●● ● ● ●●●●●● ● ●●●●● ●●●●●● ●● ●●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●● ● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●● ●●● ●●●●●● ●●●●●●● ● ●●●●● ●●●●● ●●●●●●● ●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●● ●●●●● ●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●● ●●●●●●●●●● ●●●● ●● ●●●●●●●●●●●●●●●●● ●●●C4−0.4−0.20.00.20.4●●●●●●● ●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●● ● ●●● ● ●●●● ●●● ●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●● ●●●● ●●●●● ●●●● ● ●●●●●●●●●● ● ●●●●●●●●●●●●● ●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●● ● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●● ●●●●●●● ●● ●●●●●●●●●● ●●●●●●●●●●●●●●● ●●● ● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●● ● ●●●●●●●●●●● ●●●●●●● ●●●● ●●●● ●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●● ●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●● ● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●C5SizeAUC0.30.40.50.60.70.80 1000 2000 3000●●●● ● ●●●●●●●●●●●●●● ●●●●●● ● ● ●●● ● ●● ● ●●●●●● ● ● ●●●●● ● ●● ●● ● ● ● ●●● ● ●● ● ●●●●●●●●●● ● ●●●●●●● ●● ● ●●● ●●●●● ● ●●●●●●●●●●● ●● ●● ● ●●● ●●● ● ● ● ●●●●●● ●● ●●●● ● ● ●●●●●● ●●●● ● ●●● ● ●● ●●●●●●●●●●● ● ●●●● ● ● ●●●●●●●●●●●● ●● ●●●●●● ●● ●● ● ● ● ● ● ●●●●●●● ●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●●● ● ● ● ● ● ●● ●● ●●●●●●●● ● ●●●●●●● ●● ●●● ●● ●● ● ●● ●● ● ●●●●● ●● ● ●●● ●●●●●●●●● ●●● ●●● ●●●●●●●● ●●●●●●● ● ● ●C1●●●●●●●●●●●● ●●●●●● ●●●● ●● ●●●●● ●●●●●● ●●●●● ●●●●● ●●● ●●●● ●●●●● ● ● ●●● ● ●●●●●●●●● ●● ●●●●● ●●● ●●●●●●●●●●●●●●●●● ●●●●●●● ●● ●● ●●●● ● ●●●● ●●●●●●● ●●●● ●●●●●●●●●●●●●●● ●● ● ●●●●● ● ● ●●●●●●●●●●●● ●●●● ● ●● ●●● ● ● ●●●● ● ● ● ● ● ● ●●●● ●● ●●● ●●●●● ●●●●●●●●●● ●● ● ●●●●●●●●●●●●●●●●●● ● ●●●● ● ●●● ●●●●● ● ●● ●●●●● ●●●●●● ● ●●●●●●●●●●● ● ● ●●●● ● ●●●●●● ● ●●●●● ● ●●●●●●●●●● ●● ●●●● ●●●●●●●●●●● ●●● ● ●●●● ● ●●●●●●●●● ●●●●● ● ●● ●●●●●●●●● ● ●●●●●●● ● ●● ●●●●●●● ●●●● ● ●● ●●● ● ● ●● ● ●●●● ●●●● ●●●●● ● ● ●●● ● ●●●●●●●●●● ● ●●●●●●●●● ● ●●●●●●● ● ●● ●●● ●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●●● ● ●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●● ●●●●●● ● ●●●● ● ●●● ●●● ●●●●●●●●● ● ● ●●● ● ●● ● ●●●●● ●●●●● ●●●● ● ●●●●●●● ●● ● ● ●●●●●●●●●●●● ● ●●●●●●●● ●● ●● ●●● ● ●●●●●●● ●●●● ●●●●●●●●●●●● ●● ● ●●● ●● ●●●●● ●● ● ●● ●●●●● ● ●● ● ●●●●●●●● ● ●● ●●●●●●●●●●● ●● ●● ●●●●●● ●●●● ● ● ● ●●●●●●●● ●● ●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●● ●● ●●● ●●● ●● ● ● ● ●●●●● ●●●●●● ●●● ● ● ● ● ●● ●● ● ●● ●●●●●●● ●●● ● ●●●●●●●●● ●●● ●●● ● ● ●● ● ●●●●●● ● ●●● ● ●●● ●●● ● ●●●●●●●●●●●● ● ● ●●●●●●●●● ● ●●●●●●●● ●●●●●● ● ●● ● ●●●●●●●●● ●●●●●●●●● ● ●●●●● ●●●● ● ●●● ●●●●●● ●● ●●●● ● ● ● ●●● ● ●●● ●●●●●●●● ●●●●●●●●●●● ●●● ●●●●● ●● ●●●● ● ●● ●●●●● ● ● ●●● ●● ●●●●●●● ●●● ●● ●●●●●●● ●●●●● ●●● ●●●●●●●●●●●●●●●● ● ●●● ●● ● ●●● ●● ●●●● ●●● ●●●●●●● ●● ●● ●●●● ●●● ●●●● ● ●●●●●●●●●●●●● ●● ● ●●●●●●●●●●● ●●● ● ● ●●●●●●● ●● ●●●●●●● ●● ● ● ●●●●●●●●●● ● ●●●●● ● ●●● ● ●●●●●●●●● ●●●●●● ●●●●●●●●●●● ● ● ●●● ●●●● ●●●●●●●●● ●● ●●●● ● ●●●●●●●● ● ● ●●●●●●●●● ●● ● ●● ●●●● ● ●●●● ●●●●●●●●●●●● ● ●●●●●●● ● ● ●●●●●●●● ● ●●●● ● ●●●●●●●●● ●●●●●●● ●●● ●●●●●● ●● ● ●●●●●● ● ●●● ● ●●●●●● ●●●● ●●●●● ●●●● ●●●●●● ● ●●●●●●●●●● ● ●●●● ●●●●● ● ●●● ● ●●● ●●●●●●● ●●●●●●●●●● ● ●●●●● ● ●● ●●●●●●●● ● ●● ●● ● ● ●● ● ● ●●●●● ● ●● ● ●●●●●●● ● ● ●● ●● ●●●●●●● ●● ● ●●●●● ● ●●● ● ● ●● ● ●● ●● ●●●●●●●●●●● ● ●● ● ●●●●●● ● ●● ● ● ●●●●●●●●●●●●● ● ●●●●● ● ●●●●●● ● ●●●●● ●●●●●●●●●● ● ● ●● ● ●●●●●●●●●●●● ●● ●●●●● ●●● ●●●●●●● ● ●●●● ● ● ● ●●●●●● ● ● ●●● ● ●●●●●●●●●●● ●●●●●● ● ●●●●● ● ●●● ●● ●●● ● ●●●● ●● ●● ●●●● ● ●●●● ●● ●● ● ●●●●●●●●●● ●● ● ● ● ● ●●●●●●● ● ●●●●● ● ●● ● ● ● ● ●●●●●●● ● ●● ●● ●● ●●●●●●● ●●● ● ●●●●●●C2●●●● ●●●● ●●●●●●●●●●●●● ● ● ● ●●●● ●●●● ●●●●●●●●●● ●● ● ● ●●●●● ●●●●●●● ●●●●● ●● ●●●●●● ● ●● ●●●●● ● ●● ●● ● ●●● ● ●●● ● ●● ●●●●●●●● ●●● ● ● ●●●●●●●●●●●● ●●● ●●●●●●●●●●●●● ●●● ●● ● ●●●●●●●●●●● ● ●●● ● ●●●●●●● ●●●●●●●● ●●●●●●●● ● ●●● ●●●●●●●●●● ● ● ●●●●●● ● ●● ●● ●●●●●●●● ● ●●●● ● ● ●●●● ●●● ●● ●●● ●●●●● ●●●●●● ● ●●●●● ●●● ● ●● ●●● ● ●●●● ●●● ●●●●●●●●●●●●●● ●●● ●● ● ●●●●●●●●●●●●●●●● ●●● ● ●●● ● ●● ●●● ● ●●● ● ● ●●●●●● ● ● ●●●● ●●●●●●● ●● ●●● ●● ● ●●●● ● ● ●● ●● ● ●●●●●●●● ●● ●●●● ●● ●●●●●● ● ●● ●●●●● ● ●●● ●●●●●● ●● ● ●●● ● ●●●● ● ● ● ●●● ● ●●● ●●●●●●●●●● ● ●●● ●● ●●●●● ● ●●● ●●●●●●●●●●●●●●● ● ●● ●●●● ●● ●●● ● ●●●● ● ●● ●●●●●● ●● ●●● ●●●●● ●●●●● ● ● ●● ●●●● ●● ● ● ●●● ●●●●●●●●●● ● ● ●● ●● ● ●●● ●●● ●●● ● ● ● ●●● ●● ● ●●●●●● ●●●●● ●●●●● ●●●● ●● ● ●● ● ●●●●●●● ● ●●●● ● ● ●●●●● ● ● ●●●●●●●●●● ● ●●●● ●● ● ●● ●● ● ●●●● ●● ●●●●●● ● ● ●● ● ●● ●● ●●● ●● ●● ●● ● ●●●●●● ●●● ● ● ●●● ● ●● ● ●●● ● ●● ●●●● ●●●●● ●● ●●● ●●●● ● ●● ●●●●●●●●●● ●●● ●●●●●●● ● ●● ● ● ●●●●●● ● ● ● ●●●● ● ● ●●●●●●● ● ●●●●●●●●●●●●●●●●●C30.30.40.50.60.70.8●●●●●●●●●● ●● ●●●● ● ● ● ●● ●●● ● ●●●● ●●●● ●● ●●●● ● ●●●●●●●●● ●●● ●● ●● ●●●●● ● ●● ●●● ● ●●● ●●●●●●●●●●●●●●●● ●●●● ●● ●●●●● ●●●●●● ● ●●●●●●●●●● ● ●● ● ●●●●● ● ●●● ●●● ●●●●●●●●●●● ●●●●●● ●●●● ●●●●●●● ● ● ●●●● ● ●●●●●●●● ●●●● ●●●●●●● ● ●● ● ●●●●●●●●● ●● ●● ● ●●●●● ●●●●●● ●●●●●●●●●●●●●● ● ● ●●●● ●●●●● ●●● ● ●● ●●●●●●●●●● ● ● ● ●● ●●●●●●●●●● ● ● ●● ●● ● ● ● ●●●●● ●●●●●● ●●●●●● ●●●●●●●● ●● ●●●●●●●●●●●●●●● ●●●●●● ● ● ●●●●● ● ● ●●● ● ●●●● ● ●●●● ●●●●●● ●●●● ●●●●●● ● ● ● ●● ●● ● ●● ● ●●●●● ●●●●● ●●●●●●● ●●●●●●●●●●● ● ●●●● ● ●● ●●●●●●●●●●● ● ●●● ●●●●●●●●● ●● ●●●●●●●●●●●● ●●● ●● ● ● ●●●●●● ● ●●●●● ●●● ●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●● ● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●● ●●●●●●●●●●● ●●●●● ● ●●● ●●●●●●●●● ●●●●●● ●●●●● ●● ● ●●● ● ●●●●●●●●●●● ● ●●●●●●● ● ● ●●●●●●● ●●●●●●● ● ●●●●●● ●●●●●●●● ●●● ●●●●●●●● ● ●●●●●●●●●●●●●● ● ●●●●●●●●● ● ● ●●●●●● ●●●● ● ●●● ●● ●●●●●●●● ●● ●●● ●● ●●● ●●●●●●●●● ●●● ●●●●●● ●●●● ● ●●●●●●●● ● ●●●●●● ●●●● ●●● ●● ●●●●●●● ● ●●●●●●● ●●●●●● ●●●●●●●● ●●●● ● ●●●●● ● ● ●●●●●● ● ● ●● ●●●●●●● ● ●●● ● ●C40.30.40.50.60.70.8●●●●●● ● ●●●● ●●● ●●● ●●●●●● ● ● ● ● ● ● ● ●●●●●●●●●●●●● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●● ● ● ● ●●●●●●●● ●●●●●●●●●● ● ●● ●●●●●●●●●● ●●● ●●●● ●●●●●●● ●●●● ●●●● ●●●●●●●●●●●● ●●● ● ●● ● ●● ●●●●● ● ● ● ●●● ●●● ● ●●●● ● ● ● ●●●●●● ● ●● ●●●●●●●●●●●●●● ● ● ● ●●●●● ●●●●●●●●●●●●●●●●●●● ●● ●●●●●● ● ●●● ● ●●● ● ●●●●● ● ● ● ● ●●●●●● ●●●● ● ●●●●● ● ●●●●●● ● ●●●●●●●● ●● ●●●●● ● ●● ●●●●●●● ●●●●●●●●● ●●●●●● ● ●● ● ● ●●●● ●●●●●●●●●●●●●● ● ● ●●●● ● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ● ●●● ●●●●●● ●●●●● ●●● ● ● ● ● ● ●●●● ●● ●●● ●●●●●●●●●●●●●● ●● ● ●● ●●●●●●● ●●●●●● ●● ●●●●● ●●●●●●●●●●●● ●● ●● ●●●●●●● ●●●●● ●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●● ● ●●●●●●●●●●●● ●●●●●● ● ● ● ●●●●●●●● ●●● ●● ●●●●●●● ●●●●●●●●●●●●●● ●●● ●●●●●● ● ●●●●●●●●●●●●●● ● ●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●● ● ●●● ● ●●● ●● ● ●●●●●● ● ●●●●●●●●●●●●●●●●●● ●●● ●● ●●● ● ●●●●●●● ●●●●●● ● ●● ●●● ●● ●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●● ●● ● ●● ●●●●●●● ● ● ●● ●●●●●●●●●●●● ● ●●●●●●●●●●●●● ●●●●● ●● ● ●●● ●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●● ●●● ●●●●●●●● ●●●●●● ●●●●●● ●●●●● ● ●●●●●● ●●●●● ● ●● ● ●● ●●●●● ●●●●●●●●●●●●●●●●● ● ●●●●●●●●●● ● ●●●● ●●●●●●●●●●●●●●● ●● ●●● ●● ● ●●●●● ●● ●●●● ● ●● ●●●● ● ●●●●●●●● ● ●● ●●● ● ●● ●●●●●●●●●●●●●● ● ● ● ●● ●●●●●● ●● ●●● ●●● ●●●●●●● ●●● ● ● ●●●● ●●● ●●●●●●●● ●●● ●●● ● ●●●●●● ●● ● ● ● ●●●● ● ● ● ●● ● ●●●●●● ● ●●●●●● ●● ● ●●● ●● ●●●●● ● ●●● ●●●● ● ●●● ●● ●● ●● ●● ●●●●● ●●● ● ●●●● ●● ● ●● ●●● ● ●●●● ●●●● ●● ●●● ● ●● ●●●● ● ●●● ●● ● ● ● ● ●● ●●● ●● ● ● ●● ● ●● ●● ● ●●●●●● ● ●●●●● ●●●●●●●● ●●●●●●●●●●●● ●●● ●●● ● ●●●●●●● ● ●● ●●●● ● ● ●● ●● ● ●●● ●● ●●●●●● ● ●● ●● ●●●●●●●● ●●● ●●● ● ● ●●●●● ● ● ● ●● ●● ● ● ●● ● ●●● ● ●● ●● ●●●●●●●●● ●●● ●●●● ● ● ●●● ●● ● ●C5Figure 4.9.: AUC and weight versus set size <strong>for</strong> the set centroid statistic, using the centroidclassifier.86


4.3. ResultsHerceptin), whereas ER+ (estrogen-receptor positive) cases should be treated by tamoxifen.The basal-like subtype, mainly defined by a set <strong>of</strong> intermediate filament genes (keratins)that histologically stain cells with spindle <strong>for</strong>m, and are located further from the milk ducts,represents a collection <strong>of</strong> cancer cases that are harder to treat. However, there have beenconcerns about the stability and robustness <strong>of</strong> subtypes defined from data, especially fromthe relatively small datasets originally used (Pusztai et al., 2006). In addition, the definition<strong>of</strong> subtypes does not necessarily correspond to clinical outcomes such as distant metastasis,since these phenotypes were not taken into account in the clustering procedure.The basal-like subtype largely corresponds to the “triple-negative” breast cancers, so namedbecause they are characterised by being ER−/PR−/HER2− (estrogen receptor negative, progesteronereceptor negative, and HER2 negative). Traditionally, ER and PR status has beendetermined using immunohistochemical assays (protein staining <strong>of</strong> tissue).Consequently, it has been suggested that a more clinically relevant definition <strong>of</strong> the subtypeswould be based on gene expression (mRNA levels) rather than immunohistochemistry(Desmedt et al., 2008; Loi et al., 2007; Wirapati et al., 2008), directly taking into accountin<strong>for</strong>mation about clinical outcomes, thus defining three molecular subtypes: ER−/HER2−,ER+/HER2−, and HER2+. The ER−/HER2− subtype roughly corresponds to the intrinsictriple-negative class, but excludes PR status, since its overall mRNA level in breast tissue isnot high, and the role <strong>of</strong> progesterone-receptor status in defining molecular subtypes is currentlyunclear. The ER+/HER2− roughly corresponds to the intrinsic luminal A/B classes,and the HER2 subtype is the same across the two classification systems.Our results above show a strong cell-cycle signature as highly prognostic <strong>of</strong> distant metastasis,supporting existing findings (Desmedt et al., 2008). The association <strong>of</strong> cell-cycle genesto increased risk <strong>of</strong> metastasis has been mainly attributed to the breast cancer cases that areER+ (estrogen-receptor positive) (Buyse et al., 2006; Loi et al., 2007), which comprise the majority<strong>of</strong> the breast cancer population. To classify the samples in our data into their molecularsubtypes, we followed the procedure described by Desmedt et al. (2008), and assessed the list<strong>of</strong> gene modules, which are intended to represent different biological functions such as tumorinvasion, immune response, angiogenesis, apoptosis, proliferation, and ER and HER2 signaling.We clustered the samples based on their ER and HER2 module scores (a three-componentGaussian mixture model with diagonal covariance using R package flexmix (Leisch, 2004))into the three molecular subtypes: ER−/HER2−, ER+/HER2−, and HER2+ (Figure 4.10).The number <strong>of</strong> cases in each subtype is shown in Table 4.4. Subsequently, we reran our<strong>analysis</strong>, which consists <strong>of</strong> training the centroid classifier on the MSigDB set statistics, oneach subgroup. Table 4.5 shows the top gene sets <strong>for</strong> each subgroup <strong>for</strong> the set centroidstatistic. The set centroid, set medoid, and set median show enrichment <strong>for</strong> genes from theAURKA module in the ER+/HER2− as expected, and to a lesser extent an immune responsesignature (STAT1 module) in the ER−/HER2− subtype, manifesting as IFN-γ-related sets.87


4. Gene Sets <strong>for</strong> Breast Cancer PrognosisERBB2−1 0 1 2 3●●●●●●●●●●●●● ●●●●● ● ●●●●●●● ●● ●● ●1 ●●● ●●●●● ●●●●●●●●●●●●●●●● ● ●●●●● ●●●●23−2 −1 0 1 2ESR1Figure 4.10.: Expression <strong>of</strong> ESR1 (ER) versus ERBB2 (HER2) <strong>for</strong> the combined dataset. Amixture <strong>of</strong> three Gaussians is fitted to the data. Clusters 1, 2, and 3 representthe ER−/HER2−, ER+/HER2−, and HER2+ subtypes, respectively.88


4.3. ResultsThese results show that the gene sets identified as strongly associated with metastasis cansuccessfully reproduce previously known biological mechanisms <strong>of</strong> breast cancer metastasis.Additionally, the sets are diverse enough to capture different cellular mechanisms (cell cycleversus immune response) when applied to data with different cancer subtypes. SupplementaryFigure A.8 showing the expression <strong>of</strong> several genes related to breast cancer prognosisand molecular subtype, and the correspondence with the ER/HER2 subtype classification.Apart from the unsurprising association <strong>of</strong> ESR1 (ER) and ERBB2 (HER2) with the subtypes,GATA3 and FOXA1 especially stand out as genes with expression associated with theER/HER2 subtype, being under-expressed in the ER−/HER2− subtype. Both GATA3 andFOXA1 are known breast cancer prognosis indicators (Albergaria et al., 2009), and suppression<strong>of</strong> GATA3 has been linked to loss <strong>of</strong> regulation <strong>of</strong> tumor differentiation (Dydensborg etal., 2009; Kouros-Mehr et al., 2008).We also investigated how the prognostic sets derived from different set statistics overlappedwith the genes in the modules defined by Desmedt et al. (2008). Figure A.7 showsKolmogorov-Smirnov plots <strong>for</strong> each set statistic in each ER/HER2 subtype separately. Thesets were sorted in increasing order <strong>of</strong> centroid classifier weight w. For each subtype, theline in the plots moves up one step <strong>of</strong> size 1/m (m is number <strong>of</strong> sets containing at least one<strong>of</strong> Desmedt’s genes) when the set contains at least one genes belonging to a module definedby Desmedt et al. (2008) and down when not (step size <strong>of</strong> 1/n where n is number <strong>of</strong> sets notcontaining at least one genes). Each molecular subgroup uses a different set <strong>of</strong> Desmedt’smodule genes: HER2+ is compared against modules PLAU and STAT1, ER−/HER2− againstmodule STAT1, and ER+/HER2− against module AURKA. In contrast to the other set statistics,the set PC, set t-statistic, and to some extent the set U-statistic, exhibit more pronouncedenrichment <strong>of</strong> Desmedt’s module genes at the top and bottom <strong>of</strong> the sorted set list, indicatingthat the set PC and the set t-statistic are the most concordant with the genes defined by theDesmedt modules. There<strong>for</strong>e, the set PC seems to be the best set in terms <strong>of</strong> reproducingprevious module definitions, but does not per<strong>for</strong>m as well as the other set statistics in terms<strong>of</strong> predicting metastasis (see Section 4.2.2). The set t-statistic is perhaps a better compromisein terms <strong>of</strong> predictive ability and agreement with Desmedt’s modules.4.3.6. Do the Gene Sets Point to the Same Biology as the Genes?We next investigated whether the top gene sets reflect the same underlying biology as thetop genes. In the combined data, we trained three types <strong>of</strong> classifiers: l 2 -penalised logisticregression (R package penalized (Goeman, 2008)), SVM with linear kernel (R packagekernlab (Karatzoglou et al., 2004)), and the centroid classifier. Each classifier was trainedon the genes and the gene set statistics (set centroids), <strong>for</strong> a total <strong>of</strong> six models.For each model, we ranked the features by the absolute value <strong>of</strong> their weights. We thenselected the top 512 genes, which is a high enough number <strong>of</strong> genes producing a high AUC89


4. Gene Sets <strong>for</strong> Breast Cancer PrognosisClass # MSigDB Set Cat. Description SignER−/HER2− 1 chr7q12 C1 Genes in cytogenetic band +1chr7q122 COLLER MYC DN C2 Genes down-regulated by MYC −1in 293T (trans<strong>for</strong>med fetal renalcell).3 IFNGPATHWAY C2 IFN gamma signaling pathway +14 GRANDVAUX IFN NOT IRF3 C2 Genes up-regulated by interferonalpha,beta+1UPbut not by IRF3 in Ju-rkat (T cell)5 GNF2 ST13 C4 Neighborhood <strong>of</strong> ST13 −16 GNF2 CD48 C4 Neighborhood <strong>of</strong> CD48 +17 GNF2 GLTSCR2 C4 Neighborhood <strong>of</strong> GLTSCR2 −18 MENSE HYPOXIA DN C2 List <strong>of</strong> Hypoxia-suppressed genes −1found in both Astrocytes andHeLa Cells9 HSA03010 RIBOSOME C2 Genes involved in ribosome −110 GCM TPT1 C4 Neighborhood <strong>of</strong> TPT1 −1ER+/HER2− 1 GNF2 MKI67 C4 Neighborhood <strong>of</strong> MKI67 −12 GNF2 TTK C4 Neighborhood <strong>of</strong> TTK −13 GNF2 HMMR C4 Neighborhood <strong>of</strong> HMMR −14 GNF2 CCNA2 C4 Neighborhood <strong>of</strong> CCNA2 −15 GNF2 SMC2L1 C4 Neighborhood <strong>of</strong> SMC2L1 −16 GNF2 ESPL1 C4 Neighborhood <strong>of</strong> ESPL1 −17 GNF2 CDC20 C4 Neighborhood <strong>of</strong> CDC20 −18 GNF2 H2AFX C4 Neighborhood <strong>of</strong> H2AFX −19 GNF2 RRM2 C4 Neighborhood <strong>of</strong> RRM2 −110 ZHAN MM CD138 PR VSRESTC250 top ranked SAM-defined overexpressedgenes in each subgroupPRHER2+ 1 chr4p C1 Genes in cytogenetic band chr4p −12 chr1q11 C1 Genes in cytogenetic band +1chr1q113 DAC FIBRO DN C2 Downregulated by DAC treatment−1in LD419 fibroblast cells4 GNF2 MKI67 C4 Neighborhood <strong>of</strong> MKI67 −15 GNF2 CCNA2 C4 Neighborhood <strong>of</strong> CCNA2 −16 GNF2 TTK C4 Neighborhood <strong>of</strong> TTK −17 GNF2 H2AFX C4 Neighborhood <strong>of</strong> H2AFX −18 GNF2 HMMR C4 Neighborhood <strong>of</strong> HMMR −19 CROONQUIST IL6 RAS DN C2 Genes dowmregulated in multiple −1myeloma cells exposed to the proproliferativecytokine IL-6 versusthose with N-ras-activating mutations.10 CROONQUIST IL6 STARVE C2 Genes upregulated in multiple −1UPmyeloma cells exposed to the proproliferativecytokine IL-6 versusthose that were IL-6-starved.Table 4.5.: Top 10 MSigDB sets <strong>for</strong> ER/HER2 molecular subtypes, chosen by the centroidclassifier using the set centroid statistic. Sign=−1 if expression is negatively associatedwith long-term survival, and +1 <strong>for</strong> positive association with long-termsurvival.−190


4.3. ResultsClassifier # MSigDB set p-value matches set sizeCC 1 GNF2 MKI67 < 1.00×10 −40 31 472 GNF2 TTK < 1.00×10 −40 29 573 GNF2 CCNA2 < 1.00×10 −40 48 994 GNF2 HMMR < 1.00×10 −40 42 785 GNF2 SMC2L1 < 1.00×10 −40 26 516 GNF2 CDC20 < 1.00×10 −40 46 917 GNF2 ESPL1 < 1.00×10 −40 27 588 GNF2 H2AFX < 1.00×10 −40 24 549 GNF2 RRM2 < 1.00×10 −40 32 6810 chr1q11 2.32×10 −6 2 4SVM 1 chr7q12 6.23×10 −4 1 12 chr3q11 1.00 0 83 chrxq 1.00 0 24 BYSTRYKH RUNX1 TARGETS GLOCUS 8.06×10 −3 1 135 TESTIS EXPRESSED GENES 7.28×10 −7 4 1076 chr22q 1.00 0 67 REGULATION OF G PROTEIN COUPLED 4.28×10 −4 2 48RECEPTOR PROTEIN SIGNALING PATH-WAY8 chr11p14 1.00 0 209 TERCPATHWAY 1.00 0 1510 chr1q41 2.02×10 −4 2 33LR 1 chr3q11 1.00 0 82 chr22q 1.00 0 63 TERCPATHWAY 1.00 0 154 chrxq 1.00 0 25 BYSTRYKH RUNX1 TARGETS GLOCUS 8.06×10 −3 1 136 HSA00130 UBIQUINONE BIOSYNTHESIS 1.00 0 87 chr20p 1.00 0 28 chr1q41 1.29×10 −6 3 339 chr3q12 1.00 0 2310 BETA TUBULIN BINDING 1.00 0 12Table 4.6.: Top 10 sets using the set centroid statistic using different classifiers, and the p-value <strong>for</strong> the size <strong>of</strong> the intersection between the top individual genes and the topgene sets (Fisher’s exact test, one sided). CC is centroid classifier, LR is logisticregression.91


4. Gene Sets <strong>for</strong> Breast Cancer Prognosisafter which AUC does not increase much, and is a much higher number <strong>of</strong> genes than manypublished metastatic signatures. Other cut<strong>of</strong>fs (256, 1024, 2048) exhibited similar results(not shown). For each <strong>of</strong> the top ranked sets, we then checked how many <strong>of</strong> the top rankedgenes belonged to that set, using the same classifier (that is, centroid genes to centroid sets,logistic regression genes to logistic regression sets, SVM genes to SVM sets). The number<strong>of</strong> individual genes that mapped to each set was quantified using a one-sided Fisher exacttest, in order to check whether the size <strong>of</strong> the intersection between the top sets and the topindividual genes was significantly more than due to chance.As shown in Table 4.6, there is significant overlap between the top sets and top genes foundby the centroid classifier. In comparison, both logistic regression and the SVM show verylittle overlap. In other words, the top sets ranked by the centroid classifier, using the setcentroid statistic, are over-represented in the top genes selected by the centroid classifier,indicating the same underlying biological processes associated with metastasis. This doesnot seem to be the case <strong>for</strong> the other classifiers — the top sets found by them tend to notcontain the top genes identified when considering individual genes. While this phenomenonis not necessarily to be taken as a shortcoming <strong>of</strong> models such as logistic regression, it servesas further confirmation that <strong>for</strong> the centroid classifier, at least, the biology identified by thegene sets seems to be similar to that identified by the individual genes. Further work isneeded to evaluate the underlying reasons <strong>for</strong> the differences between the models. However,<strong>for</strong> the centroid classifier, the top sets are consistent with signatures <strong>for</strong> cancer progresssion,as discussed in Section 4.3.4.4.4. SummaryWhile the understanding <strong>of</strong> breast cancer etiology and prognosis have progressed using geneexpression data, one <strong>of</strong> the main challenges has been robust identification <strong>of</strong> which genes arehighly associated with metastasis, with different studies producing gene lists with little or nooverlap, raising doubts about the biological interpretation <strong>of</strong> these genes and the robustness<strong>of</strong> the results.We have shown that classifiers based on sets <strong>of</strong> genes, rather than individual genes, havesimilar predictive power but are more stable and more reproducible, both within datasetsand between datasets, and as a result may facilitate increased understanding <strong>of</strong> the biologicalmechanisms relating to breast cancer prognosis. The likely explanation is that the expression<strong>of</strong> any given gene is a function <strong>of</strong> both its contextual regulation — regulation under varyingconditions both observed and unobserved (such as noisy transcription) — as well as inherentvariability due to germ-line variations and differences in host-tumour response betweenindividuals (Morley et al., 2004). The <strong>for</strong>mer variability can be used <strong>for</strong> prognostic purposes.However, the latter reduces the prognostic accuracy since patient-level variability is typically92


4.4. Summarynot considered when building prognostic models. The lack <strong>of</strong> predictive improvement fromusing gene sets has recently been observed by Staiger et al. (2011), with possible reasons includinghigh levels <strong>of</strong> noise in the data, the simplicity <strong>of</strong> the set assignment methods (pathwayin<strong>for</strong>mation is usually not detailed nor directional), and the crudeness <strong>of</strong> the set statistics inextracting meaningful signal from the sets’ expression. Furthermore, Staiger et al. (2011)and Venet et al. (2011) have shown that sets composed <strong>of</strong> randomly chosen genes per<strong>for</strong>m aswell as those based on MSigDB, at least in regards to predictive ability. However, as Venetet al. (2011) demonstrate, more than 50% <strong>of</strong> the genes assayed in most <strong>human</strong> microarrayexperiments show some correlation with the cell cycle process, which is known to be associatedwith the outcome. There<strong>for</strong>e, even gene sets composed <strong>of</strong> randomly selected genes arelikely to have some genes associated with the cell cycle process and there<strong>for</strong>e indirectly withthe outcome. The chance <strong>of</strong> including these predictive genes should increase as the set sizeincreases. In contrast, we did not find evidence <strong>for</strong> an association between the set size andthe predictive ability <strong>of</strong> the set. Clearly, future work should include an investigation <strong>of</strong> thetop predictive gene sets and dissecting their constituent genes to better understand whichgenes in the set confer the predictive ability and which are potentially superfluous.We have found that the C4 computationally-derived sets tended to produce better classifiers<strong>of</strong> metastasis than sets from the other MSigDB categories. This difference may be due tothe fact that C4 sets are coexpressed genes based on datasets with a large number <strong>of</strong> cancersamples and were designed to be associated with the cancer phenotype. These results suggestthat there is more prognostic value in large-scale systematic ef<strong>for</strong>ts to compile lists <strong>of</strong> sets <strong>of</strong>coexpressed genes from large datasets (Brentani et al., 2003; Segal et al., 2003), as opposed to<strong>approaches</strong> that build sets from limited pathway and GO knowledge. In order to be useful <strong>for</strong>phenotypes other than breast cancer metastasis and to potentially increase statistical power,these datasets should cover a <strong>wide</strong> range <strong>of</strong> diseases and phenotypes.Importantly, our results are in agreement with current understanding <strong>of</strong> the main drivers<strong>of</strong> breast cancer metastasis, namely proliferation <strong>for</strong> ER+/HER2−, immune response <strong>for</strong>ER−/HER2−, and tumour invasion and immune response <strong>for</strong> HER2+ (Desmedt et al., 2008),suggesting that the stability advantages af<strong>for</strong>ded by gene sets do not come at the cost <strong>of</strong>biological interpretation. Apart from patient prognosis, there is also potential to apply thesame approach to other phenotypes, such as <strong>for</strong> understanding the biological mechanismsresponsible <strong>for</strong> response and resistance to anti-cancer therapies (Li et al., 2011).We have used simple set statistics to represent gene set activity. These statistics are computationallytractable and depend on predefined set memberships. Some set statistics are notalways sensible; <strong>for</strong> example, the average expression <strong>of</strong> a gene set <strong>of</strong> may not be meaningfulwhen the genes are negatively correlated or uncorrelated; different statistics may be optimal<strong>for</strong> different gene sets. Moreover, these statistics ignore the structure and temporal dynamics<strong>of</strong> the gene networks, which could be important in deciphering causal relationships between93


4. Gene Sets <strong>for</strong> Breast Cancer Prognosisgenes and phenotypes. However, reliable in<strong>for</strong>mation about the detailed structure <strong>of</strong> <strong>human</strong>gene networks is currently limited, relative to pathways in simpler organisms such as yeastwhich have been well characterised.With respect to predictive ability <strong>of</strong> metastasis, we and others (Chuang et al., 2007; Kimand Kim, 2008; Staiger et al., 2011; van ’t Veer et al., 2002) have applied <strong>wide</strong>-range <strong>of</strong> machinelearning methods to breast cancer data, including those based on individual genes and genesets, and have found similar results. While future improvement from examining even largerdatasets cannot be ruled out, there is strong evidence to suggest that the upper limit hasbeen reached. Predictive ability is hindered by variance, resulting from several factors. Thefirst factor is microarray measurement noise. Noise may be mitigated by technical replicates(several microarrays per patient sample), larger sample sizes, and, ultimately, replacement<strong>of</strong> gene expression microarrays with other technologies such as RNA-Seq (Shendure, 2008)when those are mature and cheap enough. A second factor is variability due to unmeasured(latent) factors that vary between patients and within patients across time — gene expressionmicroarrays do not take into account many other biochemical factors in the cell, such asproteins and other metabolites, and epigenetic effects. The molecular classification <strong>of</strong> breastcancer may be further refined in the future, based on these new <strong>for</strong>ms <strong>of</strong> genomic data. Ifsuccessfully combined, these additional <strong>for</strong>ms <strong>of</strong> in<strong>for</strong>mation may increase predictive ability.A third factor limiting predictive ability is the inherent stochasticity <strong>of</strong> the metastatic process,as with all biological processes — it may be that even given “perfect” in<strong>for</strong>mation about thebiological status <strong>of</strong> each patient, we may not be able to accurately predict their metastaticoutcome several years into the future, as it may essentially be a highly random event.Future WorkThere are other ways in which this data could have been analysed. First, survival modelssuch as Cox proportional-hazard models can be used to more fully take into account thetime-to-metastasis data, rather than arbitrarily discretising the outcomes into a binary variable.Second, other <strong>approaches</strong> take the gene set in<strong>for</strong>mation into account without havingto aggregate features into gene set statistics. One approach could be based on group lassomodels (Jacob et al., 2009; Meier et al., 2008), where a lasso penalty is applied to the l 2 -norm<strong>of</strong> each gene set. Such a penalty encourages selection <strong>of</strong> genes in a set, such that if one geneis selected to be in the model then other genes in the set can enter the model as well. Thegroup lasso approach is more flexible than the set statistic approach since it still operatesat the individual gene level, rather than operating on sets. Similarly to the set statistics,the group lasso still requires a definition <strong>of</strong> which genes belong to which sets. Another alternativeapproach could be an hierarchical model, either frequentist mixed-effects models orBayesian hierarchical models (Gelman and Hill, 2007). In both types <strong>of</strong> hierarchical models,genes can be analysed individually, but in addition set effects (such as each set having its94


4.4. Summaryown average expression level), and other effects <strong>of</strong> interest such as molecular subtype classificationand age can be accounted <strong>for</strong>. Again, such models potentially <strong>of</strong>fer more flexiblemodelling <strong>of</strong> the data than set statistics. The downside <strong>of</strong> such models is that they aremuch more computationally demanding, which is an important consideration when analysinggenomic datasets that commonly have upwards <strong>of</strong> thousands <strong>of</strong> features. Third, instead <strong>of</strong>using fixed predefined sets, de novo set discovery would be applied to this data, where thesets are defined from the data on the basis <strong>of</strong> coexpression,using methods such as hierarchicalclustering, Gaussian mixture models, or latent Dirichlet allocation (LDA) (Blei et al., 2003)and related <strong>approaches</strong> (Liu et al., 2010; Savage et al., 2010). De novo discovery would likelyneed a large number <strong>of</strong> samples in order to produce stable and reproducible results. Suchan approach would be easier now that many gene expression datasets are publicly availablethrough repositories such as NCBI GEO (http://www.ncbi.nlm.nih.gov/geo) and Array-Express (http://www.ebi.ac.uk/arrayexpress), as long as the issues <strong>of</strong> suitable datasetnormalisation and integration across different plat<strong>for</strong>ms are taken into account. Fourth, univariatesummaries <strong>of</strong> gene expression levels in a set, such as the 1st principal component,may be incapable <strong>of</strong> capturing a substantial amount <strong>of</strong> variation in the set, requiring multipleprincipal components. In such a case, a multivariate model or a canonical correlation <strong>analysis</strong>(CCA) approach may prove more useful. The definition <strong>of</strong> breast cancer subtypes may changeas well, as larger and more comprehensive datasets are accumulated, leading to more stableand subtle disease classes, as demonstrated recently by Curtis et al. (2012).95


5Fast and Memory-Efficient Sparse LinearModels <strong>of</strong> Large Genome-Wide Datasets5.1. IntroductionOne <strong>of</strong> the challenges raised by recent advances in the genomics <strong>of</strong> complex phenotypes isthe prediction <strong>of</strong> phenotype given genotype, such as prediction <strong>of</strong> disease from SNP data.Successful identification <strong>of</strong> SNPs strongly predictive <strong>of</strong> disease promises a better understanding<strong>of</strong> the biological mechanisms underlying the disease, and has the potential to lead toearly disease diagnosis and preventative strategies. The question <strong>of</strong> predictive ability is alsoclosely related to the proportion <strong>of</strong> phenotypic and genetic variance that can be explained bycommon SNPs and the lively debate surrounding the “missing heritability” <strong>of</strong> many complexdiseases (Manolio et al., 2009). To quantify the genetic effect, we must fit a statistical modelto all SNPs simultaneously. Lasso-penalised models (Tibshirani, 1996) are well suited to thistask, since they per<strong>for</strong>m variable selection — some model weights are exactly zero and thusexcluded from the model. In this way, lasso models remove the need <strong>for</strong> screening SNPs basedon univariable statistics prior to fitting a multivariable model <strong>of</strong> the phenotype (Wu et al.,2009).However, fitting models to <strong>genome</strong>-<strong>wide</strong> or whole-<strong>genome</strong> data is challenging since suchstudies typically assay thousands to tens <strong>of</strong> thousands <strong>of</strong> samples and hundreds <strong>of</strong> thousandsto millions <strong>of</strong> SNPs. With standard <strong>analysis</strong> tools, modelling <strong>genome</strong>-<strong>wide</strong> and whole <strong>genome</strong>data is either impossible or extremely inefficient. For example, most existing <strong>analysis</strong> tools97


5. Fast and Memory-Efficient Sparse Linear Modelsrequire loading the entire dataset into memory prior to fitting the models, which is bothtime-consuming and requires large amounts <strong>of</strong> memory to store the data and fit the models.In order to per<strong>for</strong>m simultaneous modelling <strong>of</strong> SNP variation across the <strong>genome</strong> and buildpredictive models <strong>of</strong> disease and phenotype, it is clear that there is a need <strong>for</strong> new tools thatare fast, not memory intensive, and easy to use.Here, we present the tool SparSNP, which is an efficient implementation <strong>of</strong> lasso-penalisedlinear models. SparSNP can fit lasso models to large-scale genomic datasets in minutes usingsmall amounts <strong>of</strong> memory, outper<strong>for</strong>ming equivalent in-memory methods. Thus, SparSNPmakes it practical to analyse massive datasets without the use <strong>of</strong> specialised computing hardwareor cloud computing. SparSNP produces cross-validated model weights that can be usedto select the top predictive SNPs. SparSNP also allows the resulting models to be evaluated<strong>for</strong> predictive power, and <strong>for</strong> phenotypic and genetic variance explained.5.2. BackgroundSparSNP is an efficient implementation <strong>of</strong> l 1 penalised loss minimisation with linear andsquared-hinge loss functions, which we now discuss in more detail.5.2.1. Penalised LossAs introduced in Chapter 3, statistical models are fit by minimising a suitable loss function,such as linear loss <strong>for</strong> linear regression, logistic loss <strong>for</strong> logistic regression, and hinge loss <strong>for</strong>classification. Any <strong>of</strong> these loss functions above can be penalised with l 1 (lasso) penalties andminimised to find the solutions as follows: (β ∗ 0 , β∗ )(β0 ∗ , β ∗ ) = arg minβ 0 ,β∈R L′ (β 0 , β) = L(β 0 , β) + λ ∑ ∣β j ∣. (5.1)pThe penalty λ ≥ 0 is user-specified and controls the degree <strong>of</strong> penalisation. A high l 1 penaltyencourages sparse solutions (many ˆβ j exactly zero <strong>for</strong> high-enough λ). In contrast, anothercommon penalty, the l 2 penalty, defined as ∣∣β∣∣ 2 2 = ∑p j=1 β2 j , induces proportional shrinkage<strong>of</strong> the estimates, but generally does not induce sparse solutions (Hastie et al., 2009a). Notethat the intercept term β 0 is not penalised to prevent the estimation from depending on thechosen origin <strong>for</strong> the response y (Hastie et al., 2009a). In practical terms, the lasso penaltycombines model fitting with variable selection, whereas the ridge penalty does not per<strong>for</strong>mvariable (feature) selection, requiring additional steps to select variables, such as discardingvariables with low absolute weight.pj=198


5.2. BackgroundThe lasso can be <strong>for</strong>mulated in another <strong>for</strong>m, the constrained <strong>for</strong>m (Tibshirani, 1996),(β ∗ 0 , β ∗ ) = arg minβ 0 ∈R,β∈R p L(β 0, β)s.t.p∑j=1∣β j ∣ ≤ s, (5.2)where s ≥ 0. The Lagrange <strong>for</strong>m (Eqn. 5.1) and the constrained <strong>for</strong>m (Eqn. 5.2) are equivalentin the sense that <strong>for</strong> each s there is a penalty λ that yields the same solutions (Hastie et al.,2009a).Both the l 1 and the l 2 penalties are useful in <strong>genome</strong>-<strong>wide</strong> <strong>analysis</strong>. First, in most studiesN ≪ p, where N is the number <strong>of</strong> samples and p is the number <strong>of</strong> SNPs, there<strong>for</strong>e thestandard linear and logistic solutions are not mathematically well-defined unless penalised insome way. Second, the l 1 penalty is useful since we expect only a small fraction <strong>of</strong> total SNPsto be truly causal, rather than spuriously associated with the phenotype through linkagedisequilibriumwith the causal SNP. The penalty allows to tune the exact degree <strong>of</strong> sparsity,so different models with different degrees <strong>of</strong> sparsity can be explored. Third, the l 1 penaltyshrinks the estimated coefficients towards zero, thereby reducing the tendency <strong>of</strong> the modelsto overfit — reducing the model’s generalisation error (expected error on previously unseendatasets). When inputs are highly correlated, the l 2 penalty tends to assign weights withsimilar magnitude but opposite signs to correlated inputs, whereas the l 1 penalty tends toinduce models that select one variable out <strong>of</strong> a group <strong>of</strong> correlated variables. These differencesshould be taken into account when interpreting the selected variables, especially <strong>for</strong> SNPs, aslack <strong>of</strong> inclusion in the l 1 -penalised model does not necessarily imply lack <strong>of</strong> association withthe phenotype — the excluded SNP may have been “masked” by a highly correlated SNPthat is in the model.5.2.2. Review <strong>of</strong> Methods <strong>for</strong> Fitting Lasso ModelsGiven a convex loss function, such as the linear and logistic regression and the square-hingeloss functions, the l 1 -penalised loss (Eqn. 5.1) is convex as well (with respect to the weights β).Convex optimisation is a well-understood problem, <strong>for</strong> which many tools are available (Boydand Vandenbergh, 2004; Nocedal and Wright, 2006), each with their own strengths and weaknesses.We briefly review some <strong>of</strong> the major classes <strong>of</strong> methods <strong>for</strong> fitting l 1 -penalised models;see Bach et al. (2011) <strong>for</strong> a detailed discussion <strong>of</strong> these <strong>approaches</strong> and others.• LAR and Homotopy. In the LAR (Least Angle Regression) algorithm (Efron et al.,2004), assuming standardised (zero mean and unit variance) inputs and the residualvector r i = y i − x T i β, the weight β j <strong>for</strong> the variable x j = (x 1j , ..., x Nj ) T most correlatedwith r = (r 1 , ..., r N ) T is increased towards its unpenalised least-squares weight <strong>of</strong> x T j runtil another variable x k has at least equal correlation. Then x k is entered into themodel and the residuals recomputed. This process is repeated until all variables are in99


5. Fast and Memory-Efficient Sparse Linear Modelsthe model, achieving the unpenalised least-squares solution. Modifying LAR to excludea non-zero variable that becomes zero and then recomputing the residuals results in thelasso solution, and the series <strong>of</strong> LAR solutions is then the lasso regularisation path (thepiecewise-linear series <strong>of</strong> weights <strong>for</strong> each λ). The homotopy method (Osborne et al.,2000a,b) is similar to LAR (Hastie et al., 2009a).• Coordinate descent. Also called Shooting (Fu, 1998), later rediscovered and expandedby Daubechies et al. (2004) and Friedman et al. (2007). In coordinate descent, the lossis optimised with respect to each variable separately, while holding the other fixed.This process is repeated over all variables cyclically (in which case it is equivalent tothe Gauss-Seidel method), randomly (Shalev-Shwartz and Tewari, 2009), or in someorder (<strong>for</strong> example, order <strong>of</strong> the magnitude <strong>of</strong> the gradients), until convergence. The l 1penalty is applied to the estimates using the s<strong>of</strong>t-thresholding operation. Coordinatedescent can be parallelised (Bradley et al., 2011). In addition, block coordinate descentcan be used, where minimisation is per<strong>for</strong>med with respect to blocks <strong>of</strong> variables ratherthan one variable at a time. We discuss cyclical coordinate descent in more detail inSection 5.4.1.• Projected gradient. In projected gradient methods (Duchi et al., 2008), the loss functionis minimised in the original constrained <strong>for</strong>m (Eqn. 5.2) rather than the Lagrangian<strong>for</strong>m. The loss minimisation is per<strong>for</strong>med using gradient descent, and the constraintsare imposed by projecting the weights on the feasible region where the constraints aresatisfied (the l 1 ball).• Stochastic gradient descent. Stochastic gradient descent (SGD) (Bottou and LeCun,2004; Lang<strong>for</strong>d et al., 2009) is a first order online algorithm, where a gradient descentstep β k+1 = β k − η∇L(β) <strong>for</strong> some small step size η over all the input variables istaken after each sample is encountered, as opposed to the other methods that are batchmethods, where an update is based on all samples. The step size is usually decreasedover time (step size decay). Sparsity can be obtained by truncating updates that crosszero. SGD has the advantage <strong>of</strong> requiring very little memory, since only one sampleneeds to be accessed at any one time. SGD has been shown to be very fast to achievegood generalisation error, since in practice an algorithm does not need to find the trueglobal minimum <strong>of</strong> the loss function in order to have good out-<strong>of</strong>-sample per<strong>for</strong>mance.However, SGD requires careful tuning <strong>of</strong> the step size η and the step size decay schemeto achieve good results, which may limit its <strong>wide</strong>spread adoption within the genomicscommunity, compared with other methods that require less tuning.100


5.3. Design Considerations5.3. Design ConsiderationsHaving reviewed several methods <strong>for</strong> fitting lasso models, we now discuss considerations <strong>for</strong>designing a practical method <strong>for</strong> fitting lasso models in the context <strong>of</strong> large-scale genetic data.Speed Genetic datasets <strong>of</strong>ten contain hundreds <strong>of</strong> thousands <strong>of</strong> markers, and thousands <strong>of</strong>samples. We would like to use methods that can efficiently process large amounts <strong>of</strong> datarapidly. Data are usually not analysed once — we typically per<strong>for</strong>m cross-validation, examinedifferent phenotypes (if available), fit models to subsets <strong>of</strong> the data, and so on. There<strong>for</strong>e, it isnot realistic to use a method that requires many hours or entire days to fit one model. Whenanalysing large datasets, the bottleneck quickly becomes I/O (reading data from disk), ratherthan fitting the model. There<strong>for</strong>e, not having to load all data into memory be<strong>for</strong>e fittingbegins is an advantage. In addition, speed concerns dictate the implementation language: aninterpreted language such as R or Python may be more convenient than a compiled languagesuch as C, but these languages are typically one or two orders <strong>of</strong> magnitude slower than C,with less fine-grained control over operations such as copying <strong>of</strong> data structures that can beprohibitive <strong>for</strong> large data. The use <strong>of</strong> C also permits more compact data representations.Scalability We define scalability as the ability to analyse increasingly large datasets withconstant or linearly-increasing resources. Scalability is not the same as speed. For example,any method that depends on computing covariance matrices, such as Newton’s method <strong>for</strong>minimisation, is not naïvely scalable to SNP data as storing these matrices, let alone per<strong>for</strong>mingoperations on them is beyond the abilities <strong>of</strong> current commodity hardware; <strong>for</strong> example,storing the triangular covariance matrix <strong>of</strong> 500,000 SNPs with one byte per genotype wouldrequire around 116GiB <strong>of</strong> RAM (covariance matrices are typically not sparse so sparse matricesare not useful here). Even <strong>for</strong> methods that do not utilise covariance matrices, suchas coordinate descent and quasi-Newton methods (Newton’s method using approximations<strong>of</strong> the Hessian matrix), loading all data into memory may not be practical. There<strong>for</strong>e, whenanalysing large datasets, we would like to avoid having to load all data into memory at once.In addition, tools that make as few copies <strong>of</strong> the data as possible are preferable, both <strong>for</strong>increased speed and reduced memory usage.Tuning All l 1 -penalised methods require tuning <strong>of</strong> the λ penalty. However, some methodsrequire tuning additional parameters to achieve good results. For example, stochastic gradientdescent requires manual tuning <strong>of</strong> the step size and its decay rate. We would like to free theanalyst from concerns about the numerical properties <strong>of</strong> the fitting algorithm, such as whetherconvergence has occurred or not, and let them concentrate on analysing the data. While nomethod is completely fail-safe under all possible inputs, we and others (Friedman et al., 2010)101


5. Fast and Memory-Efficient Sparse Linear Modelshave empirically found coordinate descent to be numerically stable when analysing SNP data,consistently converging <strong>for</strong> all useful penalty ranges, without any other tuning required.5.4. MethodsBased on the design considerations above, we have designed an efficient implementation <strong>of</strong>cyclical coordinate descent <strong>for</strong> fitting l 1 -penalised linear models to SNP data, requiring memorythat only grows linearly O(N + p), and outper<strong>for</strong>ming several state-<strong>of</strong>-the-art in-memorymethods (and one out out-<strong>of</strong>-core method) when accounting <strong>for</strong> the time taken to load thedata into memory. We now describe in more detail our implementation <strong>of</strong> coordinate descentin SparSNP.5.4.1. Out-<strong>of</strong>-Core Coordinate DescentMinimising l 1 -penalised loss functions is a convex optimisation problem. However, it has, ingeneral, no analytical solutions, and must be solved numerically. We use a method based oncoordinate descent to numerically minimise the loss function. By expressing the contributions<strong>of</strong> each variable to each samples in terms <strong>of</strong> a sum <strong>of</strong> the p variables (the linear predictor),we can use memory <strong>of</strong> order O(N + p), and only keep one input variable in working memoryat a time by reading data from disk and updating the estimates at the end <strong>of</strong> each epoch.Pseudocode <strong>for</strong> the algorithm is shown in Algorithm 1.In coordinate descent (Friedman et al., 2007, 2010; Van der Kooij, 2007), each variable isoptimised with respect to the loss function using a univariable Newton step, while holding theother variables fixed. Since the updates are univariable, computation <strong>of</strong> the first and secondderivatives is fast and simple (we assume that all <strong>of</strong> our loss functions are twice-differentiable,at least piecewise). The l 1 /l 2 penalisation is achieved using s<strong>of</strong>t thresholding (Friedman etal., 2007) <strong>of</strong> the Newton step <strong>for</strong> each variable β jˆβ j ← S( ˆβ j − s j , λ), (5.3)where s j = ∂L∂β j/ ∂2 L∂β 2 jis the Newton step and S(⋅, ⋅) is the s<strong>of</strong>t thresholding operatorS(α, γ) = sign(α) max{0, ∣α∣ − γ}, γ ≥ 0.For the linear loss and the squared hinge loss, one Newton step yields the exact solution withrespect to each β j (though it is not necessarily the same as the global solution, hence the need<strong>for</strong> the active-set convergence method, see below). Other losses, such as the logistic loss, canbe solved by a quadratic approximation (2nd order Taylor expansion), in which case iterationover each β j may be required, until convergence.102


5.4. MethodsThere are two key aspects <strong>of</strong> this approach that allow efficient computation without keepingall data in working memory. First, since we are per<strong>for</strong>ming univariable minimisation, boththe first and second derivatives are scalars and are computed in a single pass over the samples.Second, the partial derivative wrt ˆβ j is computed efficiently since it is based on the linearpredictor l, which is the sum <strong>of</strong> the contribution <strong>of</strong> all variables to the model l i = ˆβ 0 +∑ p j=1 x ij ˆβ j , i = 1, ..., N. The linear predictor changes only <strong>for</strong> one variable at a time, and onlyif that variable changes its value. Once the estimate <strong>for</strong> the jth variable has been updated,the linear predictor is updated by subtracting the old contribution x ij ˆβj and adding the newcontribution x ij ˆβ′ j .The linear predictor <strong>for</strong>m can be used whenever a linear or log-linearstatistical model is used, since then the predictor is additive in each variable’s contribution.By only storing the linear predictor and one vector <strong>of</strong> samples x 1j , ..., x Nj in memory at anygiven time, we can keep memory requirements to a minimum, allowing us to fit models todatasets far larger than available RAM.The coordinate descent algorithm is identical across all loss functions; the only differenceis the computation <strong>of</strong> the Newton step and the update to the linear predictor. For the linearloss, the first derivative with respect to the weight β j isand the second derivative is∂L∂β j=N∑i=1∂ 2 L∂β 2 j=x ij (l i − y i ), (5.4)N∑i=1For the squared hinge loss, the first derivative isand the second derivative is∂L∂β j=N∑i=1∂ 2 L∂β 2 jx 2 ij. (5.5)y i x ij (y i l i − 1)I(1 − y i l i > 0), (5.6)=N∑i=1x 2 ijI(1 − y i l i > 0), (5.7)where I(⋅) is the indicator function, which evaluates to one if its argument is true and to zerootherwise. Monomorphic (zero variance) SNPs are assigned zero weight since their first andsecond derivatives are both zero. When the input data are standardised such that each SNPhas zero-variance and unit-variancex ′ ij = x ij − ¯x jσ j,where ¯x j and σ j are the arithmetic average and standard deviation <strong>of</strong> the jth genotype,respectively, then the second derivative is ∂ 2 L/∂β 2 j≤ N − 1 <strong>for</strong> linear and squared hinge loss103


5. Fast and Memory-Efficient Sparse Linear Models(due to the I(⋅) term), with strict equality <strong>for</strong> linear loss. There<strong>for</strong>e the Newton step can becomputed as s j = (∂L/∂β j )/(N − 1), without explicitly computing the second derivative.5.4.2. Computational EnhancementsSparSNP employs several enhancements to the basic coordinate descent method that greatlyimprove per<strong>for</strong>mance without affecting model fit. For large datasets, the main bottleneck isI/O (reading the data into memory), not fitting the model itself. There<strong>for</strong>e, most significantspeed improvements can be achieved by reducing the number <strong>of</strong> times SNPs are loaded fromdisk and reducing the time taken to get the data into a usable <strong>for</strong>m that can be fed into themodel fitting procedure.Active-set convergence The active set method (Friedman et al., 2007) is designed to takeadvantage <strong>of</strong> the sparsity <strong>of</strong> the weight vector β, as is commonly the case in analyses <strong>of</strong>SNP data, where only a small fraction <strong>of</strong> the SNPs are expected to have non-zero weights.The method has are two main stages. First, we iterate over all variables, one at a time. Ifany variable j becomes zero (inactive) due to the s<strong>of</strong>t-thresholding, it is excluded from thenext iteration. We then iterate over the remaining active variables. When the loss converges(Section 5.4.3), we check whether the active set has changed. If the active set does not changein two such consecutive iterations then the algorithm terminates. Otherwise, all variables areadded to the active set and iterated over as be<strong>for</strong>e.Warm restarts We use a warm-restart strategy (Friedman et al., 2010) whereby we runcoordinate-descent over a grid <strong>of</strong> λ penalties λ max , ..., λ min . We define the maximal penaltyλ max as the smallest λ that makes all ˆβ j = 0; maximal λ is computed by first computing theunpenalised intercept, and then evaluating the Newton step (Eqn. 5.3) <strong>for</strong> each variable j.Each weight ˆβ j is initialised to zero. Due to s<strong>of</strong>t thresholding (Eqn. 5.3), each ˆβ j will remainzero if its step ∣s j ∣ ≤ λ. There<strong>for</strong>e, the maximal λ is min j=1,...,p {∣s j ∣}. The minimal penaltyλ min is taken to be a small fraction <strong>of</strong> the maximal λ, usually 10 −2 λ max .The process proceeds to the next fit, with the results from the previous fit with λ k (includingthe vector <strong>of</strong> solutions ˆβ, the linear predictor l, and the active set) are used to initialise thealgorithm <strong>for</strong> the next fit. This strategy reduces computation time considerably, since the(k + 1)th fit can typically start from a small active set, rather with the entire set <strong>of</strong> variables.Caching The active set is typically much smaller than the total number <strong>of</strong> SNPs, and tendsto be accessed more <strong>of</strong>ten than other variables. There<strong>for</strong>e, it is useful to keep the active setin memory rather than repeatedly read it from disk, which is orders <strong>of</strong> magnitude slower, asin addition to random-accessing disk being slower than random access <strong>of</strong> RAM, the data arebyte-packed and must be unpacked be<strong>for</strong>e use. However, all SNPs need to be accessed at104


5.4. Methodssome stage during the active set convergence process. There<strong>for</strong>e, we employ a simple prioritycache <strong>of</strong> predetermined size. If there is room in the cache, more SNPs are read in until it isfull. Once full, we read SNPs into the cache, replacing previous SNPs only if the new SNPhas been accessed more <strong>of</strong>ten in previous iterations (we keep a counter <strong>for</strong> each SNP). Thisway, the active set SNPs (and other <strong>of</strong>ten-accessed SNPs, if there is room) tend to stay inthe cache whereas the other SNPs do not, accelerating the active set method.Since caching more SNPs reduces disk accesses, the per<strong>for</strong>mance <strong>of</strong> SparSNP criticallydepends on the amount <strong>of</strong> RAM allocated to the cache. Per<strong>for</strong>mance will increase withincreasing cache size, until the point where the entire dataset is in RAM.Pre-scaling Inputs to l 1 -penalised models are typically standardised such that each genotypehas zero-mean and unit variance, since each input variable may be on a different scale, inwhich case using the same penalty <strong>for</strong> all variables may not make sense. (In the context <strong>of</strong>SNP data, scaling the SNPs corresponds to giving more weight to rarer variants).Pre-scaling is simply a time-saving measure to standardise the SNPs as a preprocessingstep: the coordinate descent method repeatedly iterates over the variables, and repeatedlyscaling the same inputs is wasteful. There<strong>for</strong>e, we scale the data as a pre-processing step,and store the means and standard deviations in a file. These parameters are later loadedduring the actual fitting stage, and the precomputed values <strong>for</strong> each SNP are fetched from alookup table instead <strong>of</strong> the {0, 1, 2} genotypes. When the fitting is complete, we trans<strong>for</strong>mthe model weights estimated on the standardised inputs back to their original scaleˆβ j = ˆβ ∗ j /σ j , (5.8)andˆβ 0 = ˆβ ∗ 0 −p∑j=1ˆβ ∗ j µ j /σ j . (5.9)where ˆβ ∗ 0 and ˆβ ∗ j are the intercept and weights estimated on the standardised inputs, and µ jand σ j are the means and standard deviations <strong>of</strong> the jth SNP.Data representation We represent the genotypes in the minor allele dosage <strong>for</strong>m {0, 1,2, NA}, where “NA” denotes missing genotypes. We use the same bytepacking scheme asPLINK, encoding 4 genotypes in one byte, greatly reducing space requirements compared to8 bytes per genotype that would be required <strong>for</strong> double precision floating point representation(the default <strong>for</strong> most numerical s<strong>of</strong>tware). Besides space savings, byte packing leads to fasterI/O, which is the main bottleneck <strong>of</strong> our method. Note that <strong>for</strong> the fitting stage, data areused in their scaled <strong>for</strong>m, which is in double precision floating point <strong>for</strong>mat, and must beunpacked prior to use. Fast unpacking is achieved using a pre-computed lookup table that105


5. Fast and Memory-Efficient Sparse Linear Modelsmaps bytes (interpreted as short unsigned integers in the range [0, 255]) to groups <strong>of</strong> fourgenotypes at a time.Update-on-changeAs mentioned earlier, the linear predictor l must be updated every timea weight β j changes its value. This involves iterating over one N-vector (<strong>for</strong> linear loss) orseveral N-vectors (pre-computed products <strong>of</strong> the l vector with the y vector, <strong>for</strong> squared hingeloss). We per<strong>for</strong>m such updates only occur when the jth variable is active (non-zero) and theestimate ˆβ j has actually changed from the last iteration, saving un-necessary updates.5.4.3. Convergence <strong>of</strong> the AlgorithmThere are three types <strong>of</strong> convergence relevant to this algorithm. First is convergence <strong>of</strong> theNewton step <strong>for</strong> each variable j. Second is convergence <strong>of</strong> coordinate descent to the globalminimum. Third is convergence <strong>of</strong> the active set.Convergence <strong>of</strong> the Newton stepNewton’s method is exact <strong>for</strong> the linear loss and thesquared hinge loss, hence convergence <strong>of</strong> the Newton step is guaranteed and we do not check<strong>for</strong> it. For other losses, such as logistic loss, Newton’s method amounts to a quadratic approximationand convergence is not guaranteed, and other techniques such as line search aresometimes used (these are not implemented in SparSNP).Convergence <strong>of</strong> coordinate descentTseng (2001) showed that coordinate descent convergesto the global minimum when two conditions are met: (i) the loss function being minimisedis convex, as is the case with most common loss functions such as those used here, and (ii)the penalty is separable, that is, the penalty is a sum <strong>of</strong> functions where each function isa function <strong>of</strong> one weight β j . The l 1 is penalty is a separable loss function and coordinatedescent is there<strong>for</strong>e guaranteed to find its minimum, when used with a convex loss function.In practice, convergence is typically measured either by the absolute change in the lossbetween iterations ∣L (k) − L (k−1) ∣ ≤ ɛ or by the relative change in loss ∣L (k) − L (k−1) ∣/∣L (k) ∣ ≤ ɛ.Convergence can also be measured in each weight β j separately, namely ∣β (k)j− β (k−1)j∣ ≤ɛ or ∣β (k)j− β (k−1)j∣/∣βj k ∣ ≤ ɛ <strong>for</strong> absolute and relative convergence, respectively. SparSNPimplements the test <strong>for</strong> absolute change in loss as we found it to <strong>of</strong>fer a good trade<strong>of</strong>f betweenspeed (number <strong>of</strong> iterations until convergence) and precision in the final estimated weights.Convergence <strong>of</strong> the active setSparSNP uses an active set method to reduce computationalcost whereby we iterate over a small set <strong>of</strong> active (non-zero) variables, instead <strong>of</strong> over all pinput variables (see Section 5.4.2 <strong>for</strong> details). Convergence <strong>of</strong> the active set means that theactive set has not changed — no variables have either entered or left the set between twoiterations. While this is similar to convergence in the weights as discussed previously, active106


5.4. Methodsset convergence means that variables that were previously zero remain exactly zero and thosethat were not zero remain non-zero, whereas convergence in the weights does not guaranteethis property (unless the tolerance ɛ is zero). There<strong>for</strong>e, we use a combination <strong>of</strong> convergencein the loss and convergence <strong>of</strong> the active set to determine convergence in SparSNP.5.4.4. Model SelectionThe λ penalty tunes the model complexity, and can be selected in several ways. The simplestway is to leave it fixed at some arbitrary value, however, this may result in suboptimalper<strong>for</strong>mance if the number <strong>of</strong> selected variables is too small or too large. A second way is toprespecify the number <strong>of</strong> non-zero SNPs required, and then search <strong>for</strong> the λ penalty that producesthe required number <strong>of</strong> SNPs (Wu et al., 2009). A third way is to use cross-validation.Cross-validation may produce models with too many false positives (non-zero weights thatshould be zero) (Meinshausen and Bühlmann, 2006). Meinshausen and Bühlmann (2010)advocate using resampling to overcome this problem. However, in practice, cross-validationworks well <strong>for</strong> selecting the best model or set <strong>of</strong> models when the number <strong>of</strong> samples is largeenough, so that there are enough samples in each training fold to reasonably estimate themodel parameters, and when the class labels are not too imbalanced, as is the case withmany case/control datasets. Since the estimates <strong>of</strong> AUC derived from finding the best modelin cross-validation may be upwardly biased, an unbiased estimate <strong>of</strong> predictive per<strong>for</strong>manceshould be derived from an independent test set.5.4.5. Space ComplexityAt any given time, we need to store in memory the following data:• The cache, representing k vectors <strong>of</strong> samples x ij <strong>for</strong> i = 1, ..., N (where k is the cachesize in terms <strong>of</strong> N-vectors);• One vector <strong>of</strong> the linear predictor l i <strong>for</strong> i = 1, ..., N;• One vector <strong>of</strong> the coefficients ˆβ j <strong>for</strong> j = 0, ..., p;• For loss functions other than the squared loss, several auxiliary vectors representingtrans<strong>for</strong>mations <strong>of</strong> the linear predictor, each <strong>of</strong> length N;• Two vectors representing whether each variable is/was active, <strong>of</strong> length p + 1.In total, the memory requirements are O(N + p).107


5. Fast and Memory-Efficient Sparse Linear Modelsactive j ← active ′ j ← Trueallconverged ← 0<strong>for</strong> j = 0, ..., p<strong>for</strong> k = 1, ..., kmax do<strong>for</strong> j = 0, ..., p doif active j thenread x j from disks j ← ∂L∂β j/ ∂2 L∂βj2ˆβ j k k−1S( ˆβ j − s j , λ) if j > 0← {ˆβ j k−1 − s j otherwise.∆ j ← ˆβ j k k−1− ˆβ jl ← UpdateLP(l, x j , ∆ j )endactive j ← ˆβ j k ≠ 0end// Check convergence in loss Lif ∣L (k) − L (k−1) ∣ ≤ ɛ thenallconverged ← allconverged + 1elseallconverged ← 0end// Active set has converged onceif allconverged = 1 thenactive ′ j ← active j <strong>for</strong> j = 0, ..., pelse// Active set hasn’t changed in two consecutive epochs, terminateif AllEqual(active, active ′ ) thenbreakelse// Active set has changed from last convergence,// reset active set and repeat <strong>for</strong> another epochactive ′ j ← active j <strong>for</strong> j = 0, ..., pallconverged ← 1endendendAlgorithm 1: The coordinate descent algorithm, showing the active set method. Thefunction AllEqual(⋅, ⋅) returns True when both input vectors are identical and Falseotherwise, All(⋅) returns True when all elements <strong>of</strong> the input vector evaluate to True,and I(⋅) is the indicator function. The variable β 0 is the intercept. kmax is themaximal number <strong>of</strong> epochs (a user-determined parameter). For the linear loss case,UpdateLP(l, x j , ∆ j ) ∶= l i + ∆ j x ij <strong>for</strong> i = 1, ..., N.108


5.4. Methods5.4.6. Discrimination and Explained Phenotypic VarianceWe measure discrimination <strong>of</strong> a classifier using the Area Under the ROC Curve (AUC orAROC) (Hanley and McNeil, 1982), defined asÂUC =N1+N + N − ∑N −∑ [I(ŷ i > ŷ j ) + 1 2 I(ŷ i = ŷ j )] , (5.10)i=1 j=1where N + + N − = N are the number <strong>of</strong> positive and negative labels, respectively; ŷ i is theprediction <strong>for</strong> the ith sample, and I(⋅) is the indicator function, I(x) = 1 when x is trueand 0 otherwise. The sample AUC has the probabilistic interpretation as the (estimated)probability <strong>of</strong> correctly ranking two randomly chosen samples in the correct order (i.e., shorttermsurvival be<strong>for</strong>e long-term survival), plus a correction <strong>for</strong> ties. AUC = 0.5 is equivalent torandom ranking, whereas AUC = 1 and AUC = 0 correspond to perfect and perfectly-wrongranking, respectively.Unlike the error rate (or, conversely, the accuracy), AUC does notdepend on the class balance <strong>of</strong> the dataset, hence it can be meaningfully compared acrossdifferent datasets.SparSNP estimates the proportion <strong>of</strong> phenotypic or genetic variance explained by a givengenetic model. The phenotypic variance is the variation observed in the phenotype in the data,and it may be due to environmental as well as genetic factors, whereas genetic variance is thevariation due solely to genetic factors, also called the heritability <strong>of</strong> the phenotype, which istypically estimated from twin or pedigree studies where the confounding environmental effectscan be minimised. Hence, the explained phenotypic variance is the variance in the phenotypein the data explained by the model, and the explained genetic variance is the proportion <strong>of</strong> theheritability that can be explained by the model. In practical terms, the higher the proportion<strong>of</strong> explained phenotypic variance, the better the model is at explaining the data. However, ifthe explained genetic variance is low, then the model is likely not capturing all <strong>of</strong> the geneticvariation that is affecting the phenotype. The details <strong>of</strong> the derivation are given in Wray etal. (2010); <strong>for</strong> convenience we repeat the main method here, following their notation.The explained phenotypic variance h 2 L[x]is on the liability scale whereas AUC is on the 0–1scale. On the liability scale, we assume that the underlying liability P (roughly interpretedas risk <strong>of</strong> disease), is distributed according to the standard normal distribution:P ∼ N (0, 1),and that the threshold T separates those patients without the disease (liability P < T ) fromthose with the disease (liability P > T ).This model is also called the probit model (theGaussian counterpart to the binomial logit model used in logistic regression). T is determinedfrom the observed population prevalence K. Since the proportion <strong>of</strong> patients with liabilityP > T is the prevalence, then T = Φ −1 (1 − K), where Φ −1 (⋅) is the inverse standard normal109


5. Fast and Memory-Efficient Sparse Linear Modelscumulative density function (cdf).The explained phenotypic variance h 2 L[x]is estimated from the AUC and the prevalenceK ash 2 L[x] = 2Q2 / [(v − i) 2 + Q 2 i(i − T ) + v(v − T )] , (5.11)wherei = φ(T )/K,v = −iK/(1 − K),and Q = Φ −1 (ÂUC), where Φ−1 (⋅) is the inverse standard normal cdf and φ(⋅) is the standardnormal density function (pdf).The proportion <strong>of</strong> explained genetic variance ρ 2 GGis thenρ 2 GG = h 2 L[x] /h2 L, (5.12)where h 2 Lis the (narrow-sense) heritability <strong>of</strong> the disease on the liability scale, which musthave been estimated be<strong>for</strong>ehand.5.4.7. Technical LimitationsMissing data For convenience, SparSNP implements random imputation <strong>for</strong> missing genotypes,where missing genotypes are randomly replaced with a genotype {0, 1, 2} (with probability1/3 each), repeatedly on each access. When the proportion <strong>of</strong> missingness is smalland the genotypes are missing at random (<strong>for</strong> example, no differential missingness betweencases and control), such a simple approach does not substantially affect the predictive abilityand does not introduce significant spurious associations. However, when missingness is highor differentiated between cases and controls, spurious associations can arise and it is crucialto either use PLINK to filter SNPs and samples with high missingness, or alternatively, toimpute the missing data using a more sophisticated method such as BEAGLE (Browning andBrowning, 2007), IMPUTE (Howie et al., 2009), or MACH (Li et al., 2010).Confounding effects SparSNP does not account <strong>for</strong> possible batch effects, which mustbe accounted <strong>for</strong> at the quality control stage. Nor does SparSNP currently account <strong>for</strong>confounders such as population stratification, admixing, or cryptic relatedness; EIGEN-STRAT (Price et al., 2006) and PLINK can be used to detect these and to filter the dataaccordingly.Applying models to new data SparSNP produces text files containing the model weights<strong>for</strong> each SNP, and can be used in prediction mode to read these weights, together with anotherBED file, to produce predictions <strong>for</strong> other datasets. Model weights are with respect to theminor allele dosage <strong>for</strong> the training data, and the reference allele may be different in anotherdataset, possibly resulting in reversal in the sign <strong>of</strong> the SNP effect. In addition, both the110


5.5. Resultsdiscovery and validation datasets must contain the same SNPs in the same ordering (markernames are not important). We recommend using PLINK to ensure that both the discoveryand validation datasets contain the same SNPs and are encoded using the same referencealleles.5.5. ResultsTo assess the per<strong>for</strong>mance <strong>of</strong> SparSNP and compare it with existing methods, we used aceliac disease case/control dataset Dubois et al. (2010), consisting <strong>of</strong> N = 11,940 samplesfrom five European populations (Italian, Finnish, two British, and Dutch), with p = 516,504autosomal SNPs. The data processing and quality control were described in the originalpublication.We used two different experimental setups to compare SparSNP with four other state-<strong>of</strong>the-artmethods, one setup <strong>for</strong> timing comparison and another <strong>for</strong> predictive comparison viacross-validation. For the timing comparison, we timed the process <strong>of</strong> fitting the model (overa grid <strong>of</strong> hyperparameters <strong>for</strong> glmnet and SparSNP, one hyperparameter <strong>for</strong> the rest). Forthe predictive comparison, we used cross-validation over a grid <strong>of</strong> hyperparameters.We compared the following methods:• SparSNP 0.87 1 , with l 1 -penalised squared hinge loss;• glmnet 1.7 (Friedman et al., 2010) 2 , with logistic loss (binomial family), running underR 2.12.2 (R Development Core Team, 2011);• liblinear 1.8 (Fan et al., 2008) 3 , with l 1 -regularised l 2 -loss support vector classification(model 5, equivalent to the l 1 -penalised squared hinge loss used by SparSNP);• liblinear-cdblock (Yu et al., 2010) 4 with (non-sparse) block l 2 support vector machine(model 1);• hyperlasso (Hoggart et al., 2008) 5 , with the double exponential (DE) prior (equivalentto lasso).All models used the minor allele dosage {0, 1, 2} as input.111


5. Fast and Memory-Efficient Sparse Linear Models50,000 SNPs250,000 SNPs500,000 SNPs●●Time (sec)200000150000100000Time (sec)15000100005000●●● ● ● ● ● ●2000 4000 6000 8000 10000Number <strong>of</strong> samples●●●●Method● glmnet● HyperLassoLL−CD−L2LL−L1SparSNP50000●●●●● ● ●● ● ●●●●2000 4000 6000 8000 100002000 4000 6000 8000 10000Number <strong>of</strong> samples(a)2000 4000 6000 8000 1000050,000 SNPs250,000 SNPs500,000 SNPs3500Time (sec)30002500200015001000Time (sec)30025020015010050● ●● ●●●2000 4000 6000 8000 10000Number <strong>of</strong> samplesMethod● glmnetLL−CD−L2LL−L1SparSNP500● ● ● ● ● ●●●2000 4000 6000 8000 100002000 4000 6000 8000 10000Number <strong>of</strong> samples(b)2000 4000 6000 8000 10000Figure 5.1.: Time (in seconds) <strong>for</strong> model fitting, over sub-samples <strong>of</strong> the entire celiac diseasedataset, taken as the minimum over 10 independent runs. (a) For all methodsincluding hyperlasso. (b) Excluding hyperlasso. For in-memory methods weincluded the time to read the binary data into R. For SparSNP and glmnet weused a λ grid <strong>of</strong> size 20, and a maximum model size <strong>of</strong> 2048 SNPs. liblinearused C = 1. hyperlasso used one iteration with λ = 1 (DE prior). The leftpanel includes all four methods, the right panel excludes hyperlasso. Theinsets show the leftmost panel (50,000 SNPs) on its own scale to better visualisethe differences.112


5.5. Results5.5.1. SparSNP makes possible rapid, low-memory <strong>analysis</strong> <strong>of</strong> massive SNPdatasetsSparSNP consistently outper<strong>for</strong>med the other methods when fitting models to the data(Figure 5.1). We ran all methods on random subsets <strong>of</strong> the celiac disease dataset, consisting<strong>of</strong> randomly selected subsets <strong>of</strong> the data with p = {50,000, 250,000, 500,000} SNPs andN = {1000, 5000, 10,000} samples, a total <strong>of</strong> nine subsets. This process was independentlyrepeated 10 times. Only SparSNP and Liblinear-Cdblock could fit models to datasetswith > 5000 samples and 250,000 SNPs, and they were the only tools that could fit modelsto > 1000 samples and 500,000 SNPs on a machine with 32GiB RAM (running on a singleCPU). It is important to note that the a<strong>for</strong>ementioned data sizes would be considered quitesmall by current standards. Also note that in contrast with SparSNP, liblinear-cdblockdoes not implement an l 1 -penalised model but a standard l 2 -penalised support vector machine(SVM), which is not a sparse model, and does not produce solutions over a grid <strong>of</strong> model sizes;instead, a computationally expensive scheme such as recursive feature elimination (Guyon etal., 2002) would be required in order to find sparse models, but we did not use RFE here.Of the remaining methods, liblinear and glmnet did not complete all experiments due torunning out <strong>of</strong> memory (on a 32GiB RAM machine) or due to the data exceeding the limiton matrix sizes in R (a maximum <strong>of</strong> 2 31 − 1 elements). hyperlasso took much longer tocomplete: ∼2 hours <strong>for</strong> the 1000 sample/500,000 SNP subset and ∼69 hours <strong>for</strong> the 10,000sample/500,000 SNP subset.We emphasise that these results are <strong>for</strong> one run over the data. In practice, cross-validationis used to guide model selection and evaluate the generalisation error <strong>of</strong> a model. Run times<strong>for</strong> cross-validation would be higher yet: 3-fold cross-validation repeated 10 times would takeapproximately 20 times longer, ∼22 and ∼4 hours <strong>for</strong> liblinear-cdblock and SparSNP,respectively, over the largest subset — making the differences in speed even more important.Also note the difference in the number <strong>of</strong> models fitted: both SparSNP and glmnet use awarm restart strategy, computing a separate model <strong>for</strong> each penalty in a grid <strong>of</strong> 20 penalties,resulting in a path <strong>of</strong> 20 separate models with different sizes, whereas liblinear/liblinearcdblockand hyperlasso computed only one model based on one penalty — exploring agrid <strong>of</strong> penalties would be costlier still in terms <strong>of</strong> time.SparSNP is implemented in C, glmnet is mainly implemented in Fortran, and liblinear,liblinear-cdblock, and hyperlasso are implemented in C++. There<strong>for</strong>e, we assume thatthe implementation language is not a large factor in the speed differences.1 http://www.genomics.csse.unimelb.edu.au/SparSNP2 http://cran.r-project.org/web/packages/glmnet3 http://www.csie.ntu.edu.tw/~cjlin/liblinear4 http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/cdblock5 http://www.ebi.ac.uk/projects/BARGEN/download/HyperLasso113


5. Fast and Memory-Efficient Sparse Linear ModelsAUCVarExpvalue0.80.60.4MethodglmnetHyperLassoLL−CD−L2LL−L1SparSNP0.21 2 4 8 16 32 64 128 256 512 1024Number <strong>of</strong> SNPs in model1 2 4 8 16 32 64 128 256 512 1024Figure 5.2.: Left: LOESS-smoothed AUC and explained phenotypic variance (denoted “Var-Exp”) <strong>for</strong> the Finnish celiac disease dataset, <strong>for</strong> increasing model sizes. Forliblinear-cdblock (LL-CD-L2), all 516,504 SNPs are included in the model.AUC is estimated over 30 × 3-fold cross-validation. The explained phenotypicvariance is estimated from the AUC using the method <strong>of</strong> Wray et al. (2010),assuming a population prevalence <strong>of</strong> K = 1%.5.5.2. SparSNP produces models <strong>of</strong> better or comparable predictive abilityWe used the Finnish subset <strong>of</strong> the celiac disease dataset (N = 2476 samples, p = 516,504SNPs) to evaluate predictive per<strong>for</strong>mance <strong>of</strong> the models in 3-fold cross-validation. Overall,the Finnish dataset had low missingness (no samples or SNPs with missingess ≥ 5%). Toenable glmnet and liblinear to run on data <strong>of</strong> this size, we used a machine with 48GiBRAM. We measured predictive ability with the area under the receiver operating characteristiccurve (AUC) (Hanley and McNeil, 1982). From the AUC we also estimated the explainedproportion <strong>of</strong> phenotypic variance (Wray et al., 2010), assuming a population prevalence <strong>for</strong>celiac disease <strong>of</strong> K = 1%. We did not evaluate the predictive ability over the entire celiacdataset, as it consists <strong>of</strong> several populations <strong>of</strong> different ethnic background (British, Italian,Finnish, and Dutch), and the case/control status may be confounded by effects such aspopulation stratification.SparSNP induced models with AUC <strong>of</strong> up to 0.9 and explained phenotypic variance <strong>of</strong> upto ∼ 40% (Figure 5.2), almost identical to that <strong>of</strong> glmnet, except <strong>for</strong> small differences at theextremes <strong>of</strong> the λ path; the differences may be due to the fact that SparSNP and glmnetuse different loss functions and have different parameters such as convergence tolerances.liblinear showed similar maximum AUC as the other methods, but much lower AUC <strong>for</strong>smaller number <strong>of</strong> SNPs in the model. liblinear-cdblock showed consistently lower AUCover the range <strong>of</strong> costs used: a grid <strong>of</strong> 18 costs C = 10 −4 , ..., 10 3 . Varying the costs did notsubstantially change the AUC (maximum changes <strong>of</strong> < 0.01 in AUC), there<strong>for</strong>e we show the114


5.6. S<strong>of</strong>tware Featuresresults averaged over all costs. Since liblinear-cdblock uses an l 2 -SVM, which does notinduce sparse models and does not natively produce a range <strong>of</strong> model sizes, we show results<strong>for</strong> a model with all 516,504 SNPs.Due to the high computational cost <strong>of</strong> running hyperlasso, we were not able to run ascomprehensive a grid search; there<strong>for</strong>e, we per<strong>for</strong>med only two replications <strong>of</strong> 3-fold crossvalidation,using the DE prior with parameter λ = {10, 20, 30, 40, 50, 60} over 10 posteriormodes, and averaged the AUC over the modes.Importantly, while SparSNP achieved AUC better than or comparable to <strong>approaches</strong>, theresources consumed were far from being equal: SparSNP per<strong>for</strong>med 3-fold cross-validationusing a total <strong>of</strong> about 1 GiB <strong>of</strong> RAM, whereas liblinear required about 24GiB, and glmnetused up to 27GiB (the total number <strong>of</strong> samples used in the cross-validation training phaseis ∼1650, or 2/3 <strong>of</strong> the total Finnish subset). Both liblinear-cdblock and hyperlassoused low amounts <strong>of</strong> memory. liblinear-cdblock used about 210MiB <strong>of</strong> memory (using50 disk-based blocks). hyperlasso used a maximum <strong>of</strong> only 2GiB (roughly the size <strong>of</strong> thetraining data), however, it was by far the slowest.5.6. S<strong>of</strong>tware FeaturesAn overview <strong>of</strong> the SparSNP <strong>analysis</strong> pipeline is shown in Figure 5.3. The s<strong>of</strong>tware implementation<strong>of</strong> SparSNP includes the following features:• implementation <strong>of</strong> l 1 -penalised linear regression <strong>for</strong> continuous traits and l 1 -penalisedclassification <strong>for</strong> binary traits;• speed — SparSNP fits models to data with 10 4 samples and 5 × 10 5 SNPs in < 10minutes on a single CPU;• small (and tunable) amounts <strong>of</strong> memory are required: ∼ 1GiB <strong>for</strong> the datasets analysedhere;• compatibility with PLINK BED (SNP-major ordering) and FAM files (single phenotype)(Purcell et al., 2007);• cross-validation is per<strong>for</strong>med natively, removing the need to manually split datasets;• produces a set <strong>of</strong> models with increasing numbers <strong>of</strong> SNPs in each model, allowing <strong>for</strong>model selection based on cross-validated predictive per<strong>for</strong>mance;• calculates the area under receiver-operating-characteristic curves (AUC) and explainedphenotypic or genetic variance, in cross-validation or on replication datasets.115


5. Fast and Memory-Efficient Sparse Linear ModelsAcquire data∗ Discovery dataset and validationdataset (if available)∗ IMPUTE/MACH/BeagleImputationQuality control∗ Filter SNPs by posterior probabilitycalls, missingness, MAF, HWE∗ Filter samples by missingness∗ Test <strong>for</strong> differential missingness∗ Test <strong>for</strong> other confounders (populationstratification, two-locus test)SparSNPdiscovery∗ Cross-validation on discovery data∗ Plot AUC and variance explained∗ Choose a good model/sValidation∗ Check reference allele agreementbetween discovery and validationdatasets∗ Map discovery SNPs to validationSNPs if on different plat<strong>for</strong>ms∗ Predict on validation data∗ Compute AUC and variance explainedin validation dataFigure 5.3.: An example pipeline <strong>for</strong> analysing a SNP discovery dataset with SparSNP andtesting the model on a validation dataset. Most <strong>of</strong> the data preparation andprocessing can be done with PLINK.116


5.7. Discussion5.7. DiscussionWe have introduced our tool SparSNP, <strong>for</strong> fitting lasso-penalised linear models to large SNPdatasets. In experiments using a celiac disease dataset, we have shown that SparSNP isboth faster than four other state-<strong>of</strong>-the-art methods, while using low amount <strong>of</strong> memory, andachieves comparable or better predictive ability. The main bottleneck in the <strong>analysis</strong> is thelarge amounts <strong>of</strong> RAM required to fit models, which may not be feasible or accessible to manyusers. SparSNP incorporates multiple computational strategies to minimise the amount <strong>of</strong>RAM required. Even when such memory is available, the time taken to read the data fromdisk becomes the bottleneck, rather than the fitting process itself. Thus, the time takento analyse the data may be long enough to preclude a comprehensive <strong>analysis</strong> <strong>of</strong> the data,such as multiple rounds <strong>of</strong> cross-validation or experimenting with various model parameters.SparSNP makes it possible to rapidly analyse such datasets — 10 replications <strong>of</strong> 3-foldcross-validation <strong>of</strong> a 10,000-sample/500,000 SNP dataset can be per<strong>for</strong>med in about 2 hours,requiring only ∼1GiB RAM. This time can be further reduced by running multiple instancesin parallel on a compute cluster. While the celiac disease dataset analysed here is quite large,recent <strong>genome</strong>-<strong>wide</strong> studies are larger still, involving 1–6 million SNPs, either by direct assayor by imputation from HapMap (International HapMap 3 Consortium, 2010; InternationalHapMap Consortium, 2007) or 1000Genomes (1000 Genomes Project Consortium, 2010). Thenumber <strong>of</strong> samples in current datasets is larger as well, and likely to continue growing into thehundreds <strong>of</strong> thousands. For such studies, fitting multivariable models using current methodsis not feasible with standard tools. SparSNP is scalable in terms <strong>of</strong> memory requirements,and yet is faster than comparable <strong>approaches</strong>, making it suitable <strong>for</strong> analysing such datasets.All <strong>of</strong> the l 1 -penalised methods considered here induced models with consistently betterpredictive ability than the l 2 -penalised SVM implemented in liblinear-cdblock. Thisindicates that the l 1 -penalised approach is both preferable in terms <strong>of</strong> prediction and interms <strong>of</strong> interpretability, as many model weights are set to zero in contrast to the l 2 methodswhere typically none <strong>of</strong> the weights are exactly zero, and additional postprocessing is requiredto extract the subset <strong>of</strong> important variables. The success <strong>of</strong> l 1 methods may also suggest thatthe underlying genetic architecture <strong>of</strong> celiac disease is indeed sparse — very few <strong>of</strong> the assayedSNPs are strongly associated with the phenotype. Nonetheless, we cannot assume that thestrongly associated SNPs are truly causal, as they are mostly tag SNPs that are in LD withthe causal SNPs. Better detection <strong>of</strong> causal SNPs may be achieved using fine-mapping datasuch as the Immunochip <strong>for</strong> immune-related diseases (Trynka et al., 2011).There are several ways in which the basic SparSNP approach can be expanded. First,the genetic models implemented in SparSNP could be further expanded to include dominantand recessive models. This could be implemented as extra variables, in addition to theadditive coding variables, such that each SNP would be represented by three feature vectorsthat could be selected by the lasso independently, based on their contribution to the model.117


5. Fast and Memory-Efficient Sparse Linear ModelsAnother computationally cheaper scheme would be to allow these extended models to applyonly when a SNP already has an non-zero weight in the additive model. Second, the simpleimputation implemented in SparSNP is <strong>for</strong> convenience only, when missingness is sufficientlylow. Sophisticated imputation methods would necessarily come at the cost <strong>of</strong> computationalefficieny. Imputing by allele frequency may be a good compromise as it is less biased thenimputing by fixed probabilities but is computationally cheap. Third, other variables such asclinical variables (sex, age) and population structure variables (principal components) canconceptually be added to the model, thus allowing adjustment <strong>for</strong> potential confounders, andmaking SparSNP more useful <strong>for</strong> practical <strong>analysis</strong> where such confounding is common.118


6Sparse Linear Models Explain PhenotypicVariation and Predict Risk <strong>of</strong> Complex Disease6.1. IntroductionIn Chapter 4 we examined the topic <strong>of</strong> predicting the phenotype <strong>of</strong> breast cancer relapsegene expression data. In this chapter, we again consider the task <strong>of</strong> phenotype prediction,but from genetic data, In contrast with gene expression that is dynamic, genetic variationsare fixed <strong>for</strong> the life <strong>of</strong> the individual. In addition, typical genetic datasets are much largerthan most gene expression datasets, consisting <strong>of</strong> thousands <strong>of</strong> individuals and hundreds <strong>of</strong>thousands to millions <strong>of</strong> SNPs (single nucleotide polymorphisms), compared with typicalgene expression datasets that have several hundred individuals and up to tens <strong>of</strong> thousands <strong>of</strong>probes. There<strong>for</strong>e, we apply the sparse linear models discussed in Chapter 5, which are suited<strong>for</strong> fitting models to data <strong>of</strong> this scale. The value <strong>of</strong> predictive models is threefold. First,good predictive models <strong>of</strong> disease may enable better diagnostic tools <strong>for</strong> detecting individualsat higher risk <strong>of</strong> disease, enabling early intervention or treatment. Second, analysing thepredictive ability <strong>of</strong> the SNPs allows us to better quantify the genetic component <strong>of</strong> disease,relative to other factors, such as the environment. Third, characterising the most predictiveSNPs provides in<strong>for</strong>mation about potential causal mechanisms <strong>of</strong> disease and about geneticregulation in general.To maximise predictive value and identify causal SNPs, all SNPs should be modelled simultaneouslyin a multivariable model. We present a comprehensive <strong>analysis</strong> <strong>of</strong> simulated and119


6. Sparse Linear Models Explain Phenotypic Variation and Predict Risk <strong>of</strong> Complex Diseasereal data using lasso penalised multivariable models. In simulation, our multivariable modelsachieved lower false-positive rates than univariable methods <strong>for</strong> detecting causal SNPs. Using<strong>genome</strong>-<strong>wide</strong> SNP pr<strong>of</strong>iles <strong>for</strong> 32,000 individuals across eight complex diseases, we found thatour models accurately discriminated cases from controls in celiac disease and type 1 diabetes.For these diseases, the models strongly replicated across independent datasets with validationArea Under receiver operating characteristic Curve (AUC) <strong>of</strong> 0.84 <strong>for</strong> type 1 diabetes and0.82–0.9 <strong>for</strong> celiac disease, the latter across four independent datasets <strong>of</strong> different Europeanethnicities. Consequently, the models <strong>of</strong> celiac disease and type 1 diabetes explained substantialphenotypic variance in independent validation: 22% <strong>for</strong> type 1 diabetes and 21–38% <strong>for</strong>celiac disease. Investigation <strong>of</strong> type 1 diabetes and celiac disease substructure revealed highlypredictive subtypes which achieve ≥99% specificity and in some cases positive predictive values≥0.80. Taken together, this study shows supervised learning <strong>approaches</strong> can address missingphenotypic variance and reliably predict incidence <strong>of</strong> celiac disease and type 1 diabetes fromgenotype.In this chapter, we aim to comprehensively assess the per<strong>for</strong>mance <strong>of</strong> lasso-penalised modelsin SNP association <strong>analysis</strong> and to investigate their implications in the population context.First, in simulation, we investigate how well the lasso recovers true causal SNPs, as comparedwith univariable testing. In contrast to many existing simulation setups, we argue <strong>for</strong> the use<strong>of</strong> precision rather than sensitivity (power) in measuring detection ability. Second, we applylasso models to two celiac disease datasets and seven Wellcome Trust Case-Control Consortium(WTCCC) (The Wellcome Trust Case Control Consortium, 2007) datasets (bipolardisorder, coronary artery disease, Crohn’s disease, hypertension, rheumatoid arthritis, type 1and type 2 diabetes). We evaluate the predictive ability <strong>of</strong> these models in cross-validation,and <strong>for</strong> celiac disease and type 1 diabetes also in independent validation datasets. We furtherexamine the positive and negative predictive values produced by these models, taking intoaccount their population prevalence, and finally identify subgroups <strong>of</strong> celiac disease and type1 diabetes cases that can be predicted with high confidence, potentially indicating previouslyunknowndisease substructure.6.2. MethodsWe now describe the statistical models we use to model the association between genotypesand the phenotype.6.2.1. Genetic ModelsWe considered three methods <strong>for</strong> fitting statistical models to the SNP data: l 1 -penalisedsquared hinge loss, a two-stage logistic regression, and GCTA (Yang et al., 2011).120


6.2. MethodsLasso ModelsWe used l 1 -penalised squared hinge loss models, implemented the package SparSNP (as describedin Chapter 5), over a grid <strong>of</strong> 20 penalties, to induce models with increasing numbers<strong>of</strong> SNPs in the model. All models were linear in minor allele dosage {0, 1, 2}. All modelswere evaluated using cross-validation, and when a validation dataset was available, also onthe validation dataset. For the validation <strong>of</strong> models, we selected the number <strong>of</strong> SNPs inthe model that yielded the highest cross-validated AUC (see Section B.1) on the discoverydataset. We then applied these models without any further modification or tuning to theindependent validation dataset, to derive an unbiased estimate <strong>of</strong> the models’ AUC.Logistic RegressionFor the logistic regression, we first use univariable logistic regression on each SNP separately,yielding p separate modelslogit(p (j)i) = β (j)0+ x (j)iβ (j)1, j = 1, ..., p,wherelogit(p) = log(p/(1 − p)),p (j)iis the probability <strong>of</strong> disease <strong>for</strong> the ith sample based on the jth SNP, x (j)iis the ithgenotype <strong>for</strong> the jth SNP, and β (j)0and β (j)1are the intercept and regression coefficient <strong>for</strong>the jth SNP, respectively. The logistic model is fitted using a variant <strong>of</strong> iteratively-reweightedleast squares (IRLS, equivalent to Newton’s method) (Hastie et al., 2009a).The p-value<strong>for</strong> the association is derived from the z-statistic <strong>for</strong> each coefficient (Wald’s test (Agresti,2002)), where the z-statistic itself is derived from inverting the Hessian matrix evaluated atthe maximum likelihood solution.In the second stage, we filter the SNPs based on theirp-value, and fit a multivariable logistic model to all k remaining SNPslogit(p i ) = β 0 ′ + ∑ x ′ jβ j,′where x ′ i ∈ Rk represents the ith vector <strong>of</strong> genotypes <strong>for</strong> the k SNPs that remained afterthe filtering, and correspondingly β ′ ∈ R k is the k-vector <strong>of</strong> model weights, and β 0 ′ is theintercept. Prior to model fitting, we further filter the SNPs based on their correlation r 2 < 0.8(a generally accepted threshold <strong>for</strong> high LD, see <strong>for</strong> example Carlson et al. (2004); Hinds etal. (2005)), in order to reduce the effects <strong>of</strong> multicollinearity, as IRLS may fail <strong>for</strong> highlycorrelated inputs (due to a singular Hessian matrix). Finally, we use the multivariable modelpj=1121


6. Sparse Linear Models Explain Phenotypic Variation and Predict Risk <strong>of</strong> Complex Diseaseto predict the probability <strong>of</strong> case status from the genotype and the estimated model weights1Pr(y i = 1∣x i ) =1 + exp (−β0 ′ − ∑p j=1 x′ ij β′ j ).GCTABriefly, Genome-<strong>wide</strong> Complex Trait Analysis (GCTA) (Yang et al., 2011) implements amixed-effect model (Gelman and Hill, 2007), where the SNPs are considered random effectsand other variables like sex and age are fixed effects. Fixed effects are those effects that areconstant among samples with the same value <strong>of</strong> the variable. For example, using sex as afixed effect in a regression <strong>of</strong> height on sex (and excluding any other effects), all males havethe same predicted height. In contrast, random effects are effects that randomly vary betweensamples with the same realisation <strong>of</strong> the variable. For example, when regressing height onthe city in which a person lives, and using the city as a random effect, we allow <strong>for</strong> heightsto vary within each city, but they may still be more correlated within each city than betweencities, depending on the model. Modelling the city as a random effect allows us to account<strong>for</strong> such correlations in a generic way, without having to fit a specific regression term to eachspecific city that happened to occur in the sample. Typically, discrete variables modelled asfixed effects are those that all their possible values can be measured in the study (<strong>for</strong> example,male/female <strong>for</strong> sex), whereas variables are more suitable to be considered a random effect ifthey represent a sample <strong>of</strong> all possible values this variable can take (<strong>for</strong> example, a sample <strong>of</strong>cities out <strong>of</strong> all possible cities in Australia).For a continuous outcome y, GCTA implements the mixed-effect modelpy i = β 0 + ∑ x ij β j + g i + ɛ i , i = 1, ..., N, (6.1)j=1where y ∈ R N are the phenotypes, β ∈ R p are weights <strong>for</strong> variables such as sex, age, andother clinical measurements <strong>of</strong> interest, and g ∈ R N is a vector <strong>of</strong> normally-distributed geneticeffects with g ∼ N (0, Aσg), 2 var(y) = Aσg 2 + Iσɛ 2 , where A is the N × N genetic relationshipmatrix (GRM) between individuals, I is the N × N identity matrix, and σg 2 and σɛ 2 are thevariance explained by the SNPs and the variance <strong>of</strong> the residuals, respectively. The GRMis estimated from the SNP data. For binary outcomes (such as case/control data), a similarmodel to (6.1) is used, except that it is linear in the log odds (inverse logit <strong>of</strong> y). Fromthe GCTA mixed-model we then estimate the risk <strong>of</strong> each sample being a case based on itsgenotype (the best linear unbiased predictor, BLUP), and this score is then used to rank thesamples in the subsequent <strong>analysis</strong>.122


6.2. Methods6.2.2. HAPGEN2 simulationsWe used HAPGEN (Su et al., 2011) v2.2.0 to generate simulated case/control data with linkagedisequilibrium (LD) patterns based on haplotype and legend data from HapMap3 (InternationalHapMap Consortium, 2005, 2007) release 22 CEU data <strong>of</strong> chromosome 10 (73,832SNPs). In order to reduce memory requirements, we split the chromosome into 148 blocks <strong>of</strong>up to 500 SNPs each, randomly selected one SNP from each block as causal, and combinedthe blocks together to <strong>for</strong>m a complete chromosome. We used a multiplicative model <strong>of</strong> risk,where each causal SNP was randomly assigned one risk ratio per dose out <strong>of</strong> {1.1, 1.2, 1.3, 1.4,1.5, 2.0}. For example, with a risk ratio per dose <strong>of</strong> 1.1, the risk ratios <strong>for</strong> the homozygousgenotype (two protective alleles), heterozygous genotype, and homozygous genotype (two riskalleles) are 1.0, 1.1, and 1.21, respectively. We conducted eight sets <strong>of</strong> simulations, each usinga different number <strong>of</strong> samples N = {100, 500, 2500, 5000, 10,000, 20,000, 50,000, 100,000},where the number <strong>of</strong> cases and controls is balanced (N/2 <strong>for</strong> each class). In considering theprediction <strong>of</strong> each SNP as causal/non-causal, only the 148 causal SNPs were taken to be astrue positives, with the rest considered false positives.We fit l 1 -penalised squared-hinge loss models to each dataset, over a grid <strong>of</strong> l 1 penalties.For each penalty, the absolute value <strong>of</strong> estimated SNPs weights β j was thresholded at differentcut<strong>of</strong>fs to decide which SNP was causal (above cut<strong>of</strong>f) or non-causal (below cut<strong>of</strong>f). As abaseline <strong>for</strong> comparison, we used univariable genotypic (allele dosage model) logistic regression(one SNP at a time); we used the negative log 10 <strong>of</strong> the Wald-test p-value to rank the SNPsfrom most likely to be associated to the least likely. We also used the estimated regressioncoefficient from the logistic regression and the log 10 <strong>of</strong> the p-value from the 1-df allelic test,with very similar results to the logistic test (results not shown).Note that the univariable logistic model was used only to detect SNPs with significant associationswith the case/control phenotype, and there<strong>for</strong>e employed in the simulations <strong>for</strong> assessingcausal SNP recovery. The multivariable logistic model was used to create case/controlpredictive models <strong>for</strong> the WTCCC and celiac disease data, based on the SNPs identified bythe univariable method, and was not assessed in the simulations.6.2.3. Positive and negative predictive valuesThe positive and negative predictive values (PPV and NPV) (Altman and Bland, 1994) <strong>of</strong> amodel are estimated asPPV =sens × prevsens × prev + (1 − spec) × (1 − prev)andNPV =spec × (1 − prev)spec × (1 − prev) + sens × (1 − prev)123


6. Sparse Linear Models Explain Phenotypic Variation and Predict Risk <strong>of</strong> Complex Diseasewhere “sens” is sensitivity = TP/(TP + FN) , “spec” is specificity = TN / (FP + TN), TPare the true positives, FN are the false negatives, FP are the false positives, and TN are thetrue negatives, and “prev” is prevalence in the population in the range [0, 1].We estimated the PPV and NPV in cross-validation, where the confusion matrix (TP, FP,TN, FN) is derived from the predicted case/control class and the actual class in the test data.The prediction <strong>of</strong> case/control is a binarisation <strong>of</strong> the classifiers continuous output—the linearpredictor l i = ˆβ 0 +∑ p j=1 x ij ˆβ j , where x ij is the ith observation <strong>for</strong> the jth SNP in the test data,ˆβ j is the estimated coefficient <strong>for</strong> the jth SNP, and ˆβ 0 is the intercept. Samples with a linearpredictor score above a given cut<strong>of</strong>f are classified as cases, whereas those below are classifiedas controls. By varying the cut<strong>of</strong>fs, different pairs <strong>of</strong> ⟨sensitivity, specificity⟩ are achieved,thus inducing a curve <strong>of</strong> the ⟨PPV, NPV⟩ pairs. For this reason, the range <strong>of</strong> PPV and NPVis limited by the data — the discreteness <strong>of</strong> finite-size samples prevents very low or very highPPV and NPV values. We averaged the PPV in small bins <strong>of</strong> NPV in order to reduce thevariability in estimated PPV, over the cross-validation replications.6.2.4. Genomic Inflation FactorThe genomic inflation factor (Devlin and Roeder, 1999) <strong>for</strong> a given SNP j is defined asλ (j) = Y 2 A(j) /Y 2(j) , (6.2)whereY 2(j) = N[N(r 1 + 2r 2 ) − R(n 1 + 2n 2 )] 2R(N − R)[N(n 1 + 4n 2 ) − (n 1 + 2n 2 ) 2 ]is the test statistic <strong>for</strong> Cochran-Armitage trend test andY 2 A(j) = 2N[2N(r 1 + 2r 2 ) − 2R(n 1 + 2n 2 ] 2(2R)2(N − R)[2N(n 1 + 2n 2 ) − (n 1 + 2n 2 ) 2 ](6.3)(6.4)is the χ 2 allelic test <strong>for</strong> association with the phenotype.The overall inflation factor λ is typically estimated from the p per-SNP factors asλ = median(λ (1) , ..., λ (p) ). (6.5)A value <strong>of</strong> λ > 1 indicates the presence <strong>of</strong> non-random mating in the population representedin the data, which can be causes by population stratification (samples <strong>of</strong> multiple ethnicbackgrounds) or cryptic relatedness (unaccounted familial relationships), both <strong>of</strong> which canconfound typical case/control analyses.124


6.3. Results6.2.5. Data and quality controlThe Bipolar Disorder (BD), Coronary Artery Disease (CAD), Crohn’s Disease / IrritableBowel Syndrome (Crohn’s), Hypertension (HT), Rheumatoid Arthritis (RA), Type 1 Diabetes(WTCCC-T1D), and Type 2 Diabetes (T2D) datasets were obtained from the WTCCC (TheWellcome Trust Case Control Consortium, 2007), in addition to the 1958 birth cohort dataset(58C) and the National Blood Service dataset (NBS) which served as shared controls. Weused the default Chiamo genotype calls generated by the WTCCC. We removed samplesthat were excluded by WTCCC due to being highly related or duplicated, SNPs that werein the WTCCC exclusion list, and SNPs that were visually identified by WTCCC as havingbad cluster plots. There were 459,012 remaining autosomal SNPs in each dataset (Table6.1). We obtained the GoKinD-T1D case-only dataset from NIH dbGaP (accessionno. phs000018.v1.p1). For the GoKinD-T1D dataset, we removed A/T and G/C SNPs tominimise strand mismatches between that dataset and the WTCCC datasets, and removedSNPs with MAF < 0.05, missing observations > 0.01, and Hardy-Weinberg equilibrium p-value < 10 −6 . Samples were removed if they had missing phenotypes, or missing in > 0.01 <strong>of</strong>the genotypes, leaving a total <strong>of</strong> 1604 cases over 265,023 autosomal SNPs. The GoKinD-T1Ddataset was matched with the NBS control dataset to <strong>for</strong>m a complete case/control dataset.We used two versions <strong>of</strong> the each <strong>of</strong> two celiac disease datasets Celiac1 (van Heel et al.,2007) and the UK subset <strong>of</strong> the Celiac2 (Dubois et al., 2010) dataset (Celiac2-UK). Thefirst version was the original data as published, with 301,689 and 516,504 autosomal SNPs,respectively. The second was a stringently-filtered dataset, in which SNPs were removed ifthey had MAF ≤ 0.01, missingness ≥ 0.05, deviation from Hardy-Weinberg equilibrium incontrols p ≤ 0.05, differential missingness between cases and controls p ≤ 0.05, and two-locustest (Lee et al., 2010) p ≤ 0.05. Samples were removed if they had missingness ≥ 0.01, andboth samples in each pair with identity-by-descent (IBD) ˆπ ≥ 0.05 (a low level <strong>of</strong> relatedness,about half as related as first cousins, see <strong>for</strong> example(Browning and Browning, 2011; Lee etal., 2011)). We removed both samples rather than one <strong>for</strong> the sake <strong>of</strong> consistency with (Leeet al., 2010). For the stringently-filtered datasets, there were 2109 samples and 279,312autosomal SNPs <strong>for</strong> the Celiac1 dataset and 6613 samples and 471,191 autosomal SNPs <strong>for</strong>the Celiac2-UK dataset. Similarly, we also used a stringently filtered version <strong>of</strong> WTCCC-T1Dthat underwent the same filtering, leaving 4901 samples and 370,280 SNPs.6.3. ResultsWe compared the lasso squared-hinge loss model with the logistic regression in simulation,and with logistic regression and GCTA in <strong>analysis</strong> <strong>of</strong> the nine complex disease datasets. Themain results are shown in this chapter, other supporting results are included in Appendix B.125


6. Sparse Linear Models Explain Phenotypic Variation and Predict Risk <strong>of</strong> Complex Disease6.3.1. Recovery <strong>of</strong> Causal SNPs in SimulationTo assess squared-hinge loss lasso model per<strong>for</strong>mance, we used HAPGEN2 (Marchini et al.,2007; Su et al., 2011) to simulate various case/control genotype datasets where the causalSNPs were known (see 6.2.2). Unlike some other published simulations (Ayers and Cordell,2010), we define a true positive only as detecting one <strong>of</strong> the causal SNPs specified by HAP-GEN2; any other SNP is taken to be a false positive, regardless <strong>of</strong> LD or distance. Withreal SNP arrays the concept <strong>of</strong> a “causal SNP” can be unclear, as the assayed SNPs aremostly tag SNPs and the causal SNP itself may not have been assayed. In contrast, here thecausal SNP is always present, and our aim is to assess how well different statistical methodsdifferentiate signal (true causal SNPs) from noise (non-causal SNPs in LD with the causalSNP). We summarised the results <strong>of</strong> the HAPGEN2 simulations using two measures: the AreaUnder the Receiver Operating Characteristic Curve (AUC) (Hanley and McNeil, 1982) andthe Area under the Precision-Recall Curve (APRC). While the AUC is commonly used <strong>for</strong>evaluating binary classification per<strong>for</strong>mance, it can be misleading when comparing classifierswhere one class vastly outnumbers the other since a classifier with even a tiny false positiverate will incur a large absolute number <strong>of</strong> false positives, but this failing will not necessarilybe reflected in the AUC (Supplementary Section B.1). In contrast, APRC is more sensitive t<strong>of</strong>alse positives, and is more suitable when generating biological hypotheses from imbalanceddata.The lasso method was able to detect causal SNPs with fewer false positives than univariablelogistic regression (Figure 6.1). As expected, APRC increased <strong>for</strong> both methods with samplesize and with increasing risk ratios. Overall, APRC <strong>for</strong> lasso increases much faster, finallyachieving APRC = 0.8 <strong>for</strong> risk-ratio <strong>of</strong> 2.0 compared with APRC <strong>of</strong> ∼ 0.3 <strong>for</strong> the univariablemethod. In contrast, AUC results (Supplementary Figure B.2) were consistently higher <strong>for</strong>univariable logistic regression <strong>for</strong> small and medium sample sizes, and roughly equivalent <strong>for</strong>a sample size <strong>of</strong> 100,000. These results indicate that whereas the univariable method is betterat finding all causal SNPs, it does so by introducing a large number <strong>of</strong> false positives, due tohigh LD, manifesting in substantially lower APRC. In contrast, especially <strong>for</strong> smaller modelsizes, the lasso may have less sensitivity to identify causal SNPs than univariable logisticregression, but the SNPs found are far less likely to be false positives (higher specificity).6.3.2. Modelling <strong>genome</strong>-<strong>wide</strong> pr<strong>of</strong>iles <strong>for</strong> eight complex diseasesWe applied lasso models to nine discovery datasets (Table 6.1). Seven discovery datasets werefrom the WTCCC (The Wellcome Trust Case Control Consortium, 2007) — Bipolar Disorder(BD), Coronary Artery Disease (CAD), Crohn’s Disease/Irritable Bowel Syndrome (Crohn),Hypertension (HT), Rheumatoid Arthritis (RA), Type 1 Diabetes (WTCCC-T1D), and Type2 Diabetes (T2D). Two additional discovery sets were celiac disease datasets (Dubois et al.,2010; van Heel et al., 2007), denoted here Celiac1 and Celiac2-UK (samples <strong>of</strong> UK descent126


6.3. Results1005000.80.60.40.22500+5000++0.80.60.40.210000++20000++APRC0.80.60.40.250000++100000++0.80.60.40.2++++124816326412825651210242048univariable124816Number <strong>of</strong> SNPs with non−zero effects326412825651210242048univariablelasso + univariableRisk Ratio● 2.0 ● 1.5 ● 1.4 ● 1.3 ● 1.2 ● 1.1Figure 6.1.: APRC <strong>for</strong> HAPGEN simulations, using either lasso squared-hinge loss models(lasso) or the univariable logistic regression Wald test (univariable). For thelasso, different numbers <strong>of</strong> SNPs are allowed in the model, as determined by thepenalty λ. For the univariable test, all SNPs are considered. For lasso, resultsare smoothed using LOESS over the replications. For univariable, results areaveraged over the replications. The dotted vertical lines show the number <strong>of</strong> true“causal” SNPs in the data.127


6. Sparse Linear Models Explain Phenotypic Variation and Predict Risk <strong>of</strong> Complex DiseaseDisease Abbrev. Cases Controls AutosomalSNPsPlat<strong>for</strong>mReferenceBipolar Disease BD 1868 2938 459,012 Affymetrix500KCoronaryArtery DiseaseCAD 1926 2938 459,012 Affymetrix500K(The WellcomeTrust Case ControlConsortium, 2007)(The WellcomeTrust Case ControlConsortium, 2007)Hypertension HT 1952 2938 459,012 Affymetrix500KRheumatoidArthritisType 1DiabetesType 2DiabetesRA 1860 2938 459,012 Affymetrix500KCeliac Disease Celiac1 778 1422 301,689 Illumina † (van Heel et al.,2007)Celiac Disease Celiac2-UK1849 4936 516,504 Illumina † (Dubois et al.,2010)Crohn’sDisease /Irritable BowelSyndromeCrohn 1748 2938 459,012 Affymetrix500k(The WellcomeTrust Case ControlConsortium, 2007)WTCCC-T1D1963 2938 459,012 Affymetrix500KT2D 1924 2938 459,012 Affymetrix500K(The WellcomeTrust Case ControlConsortium, 2007)(The WellcomeTrust Case ControlConsortium, 2007)(The WellcomeTrust Case ControlConsortium, 2007)(The WellcomeTrust Case ControlConsortium, 2007)Table 6.1.: List <strong>of</strong> discovery datasets used in this <strong>analysis</strong>. The 1958 British Birth Cohort(N = 1480) and the National Blood Service (N = 1458) datasets were used asshared controls <strong>for</strong> all WTCCC datasets. † Celiac1 used Illumina HumanHap33v1-1 <strong>for</strong> cases and HumanHap550-2v3 <strong>for</strong> controls, and Celiac2-UK used Illumina670-QuadCustom-v1 <strong>for</strong> cases and Illumina 1.2M-DuoCustom-v1 <strong>for</strong> controls.only). We also used three validation sets <strong>for</strong> celiac disease datasets — the Finnish (Celiac2-Finn), Italian (Celiac2-IT), and Dutch (Celiac2-NL) cohorts from (Dubois et al., 2010) andone validation set <strong>for</strong> T1D — the GAIN GoKinD (Mueller et al., 2006; Pezzolesi et al., 2009)T1D dataset (GoKinD-T1D) (Table 6.2). For controls, all the WTCCC datasets used boththe 1958 birth cohort (58C) and National Blood Service (NBS) datasets (The Wellcome TrustCase Control Consortium, 2007) as shared controls. We paired the GoKinD-T1D dataset withthe NBS control dataset. To prevent including the same controls in the T1D discovery andvalidation datasets, we also used a version <strong>of</strong> the WTCCC-T1D dataset that only included theWTCCC-T1D cases and 58C controls, and this reduced version was tested on the GoKinD-T1D data.128


6.3. ResultsDisease Abbrev. Cases Controls AutosomalSNPsPlat<strong>for</strong>mReferenceCeliacDiseaseCeliacDiseaseCeliacDiseaseType 1DiabetesCeliac2-ITCeliac2-NLCeliac2-FinnGoKinD-T1D497 543 516,504 Illumina † (Dubois et al.,2010)803 846 516,504 Illumina † (Dubois et al.,2010)647 1829 516,504 Illumina † (Dubois et al.,2010)1604 1458 265,023 Affymetrix (Mueller et al.,500K 2006; Pezzolesi etal., 2009)Table 6.2.: List <strong>of</strong> independent replication datasets used. The National Blood Service (N =1458) dataset was used as controls <strong>for</strong> the GoKinD-T1D dataset. † Celiac2-IT andCeliac2-NL used Illumina 670QuadCustom-v1 <strong>for</strong> cases and controls, Celiac2-Finnused 670-QuadCustom-v1 <strong>for</strong> cases and 610-Quad <strong>for</strong> controls.6.3.3. Assessment <strong>of</strong> confounding factorsWhile population stratification, batch effects, or data missingness can create spurious associationsbetween the genotypes and case/control status, there was no evidence <strong>of</strong> confoundingby these factors in the data. We estimated the genomic inflation factors <strong>of</strong> each discoverydataset, which are based on the median deviation <strong>of</strong> the per-SNP test statistics from theexpected under the assumption that most SNPs are not truly associated with the phenotype.An inflation factor substantially larger than 1 corresponds to a larger than expectednumber <strong>of</strong> associated SNPs, potentially due to population stratification (Devlin and Roeder,1999). The genomic inflation factors are shown in Table B.2, indicating low to non-existentlevels <strong>of</strong> deviation. We also used principal component <strong>analysis</strong> (Price et al., 2006) (PCA,Supplementary Section B.2.1) to evaluate whether there was strong structure in the data,unaccounted <strong>for</strong> by LD (Figure 6.4). The genomic inflation factors were 1.0 <strong>for</strong> the WTCCCdata and between 1.051–1.056 <strong>for</strong> the celiac discovery datasets, both using the logistic regressiontest, without accounting <strong>for</strong> LD. Applying PCA to the original celiac disease datasets,we found substantial structure, and several principal components were highly predictive <strong>of</strong>the case/control status (AUC = 0.8, Figure B.5). Regions <strong>of</strong> strong LD are known to causeartifacts in the <strong>analysis</strong> (Patterson et al., 2006). There<strong>for</strong>e, we removed known high-LD regions(Fellay et al., 2009) (chr5: 44Mb–51.5Mb, chr6: 25Mb–33.5Mb, chr8: 8Mb–12Mb, andchr11: 45Mb–57Mb). In addition, we thinned the remaining SNPs <strong>for</strong> LD using PLINK, andin the PCA regressed each SNP on the previous five SNPs. After accounting <strong>for</strong> these strongLD regions, we found no evidence <strong>of</strong> strong structure in the Celiac1 dataset, and the principalcomponents were not predictive <strong>of</strong> case/control status (AUC < 0.54, Figure B.5). We alsoplotted the PC loadings <strong>of</strong> the SNPs in the Celiac1 dataset, aggregated separately in each129


6. Sparse Linear Models Explain Phenotypic Variation and Predict Risk <strong>of</strong> Complex Diseasechromosome (Figure B.4). As expected, chromosomes 5, 6, 8, and 11 have unusually highloadings in the original data. In the LD-pruned data, the loadings are uni<strong>for</strong>m across thechromosomes, indicating that the high-LD regions above are indeed the main contributorsto the variation found by PCA and that removing LD removes the structure found in PCA.These results, together with the replication in the independent datasets (see below), stronglyindicate that population structure, batch effects, and data missingness were not a significantconfounders <strong>for</strong> our predictive models.Confounding by non-population effects such as batch effects and differential missingnessis not a significant issue in the discovery datasets <strong>of</strong> celiac disease and T1D. Such effectsare known to introduce spurious associations between the genotypes and the case/controlstatus, artificially inflating the apparent explained variance and predictive power (Yang etal., 2011). To assess the impact <strong>of</strong> these effects, we generated a version <strong>of</strong> the Celiac1 andCeliac2-UK datasets that underwent more stringent filtering than the original data, with theaim <strong>of</strong> reducing the effect <strong>of</strong> spurious associations between SNPs and the case/control status(Methods). Overall, we saw only minor reductions in cross-validated AUC <strong>for</strong> these filtereddatasets. For the stringently filtered celiac disease datasets, the cross-validated AUC <strong>for</strong> thelasso models peaked at ∼0.87 <strong>for</strong> both datasets (Figure B.6). The results <strong>for</strong> stringently-filteredWTCCC-T1D were largely unchanged as well (AUC ∼0.87), indicating that the predictiveability in celiac disease and T1D is likely due to true genetic variation rather than spuriouseffects.6.3.4. Discrimination <strong>of</strong> the phenotype in cross-validationWe trained lasso squared-hinge loss models on each dataset, and evaluated their discriminationand stability <strong>for</strong> all directly-genotyped, autosomal SNPs. We evaluated discrimination <strong>of</strong> thephenotype using AUC on 20 × 3-fold cross-validation (20 × 3CV) <strong>for</strong> each dataset. Figure 6.2ashows the AUC achieved by varying the number <strong>of</strong> SNPs in the lasso model <strong>for</strong> all datasets,using all autosomal SNPs. The number <strong>of</strong> SNPs at which AUC either peaks or plateaus givesa rough estimate <strong>of</strong> the number <strong>of</strong> causal SNPs in each dataset. We can group the datasetsbased on their AUC. The first group includes WTCCC-T1D and Celiac1/Celiac2-UK, bothachieving a maximum AUC <strong>of</strong> ∼ 0.88. The second group includes RA and Crohn, achievingAUC <strong>of</strong> up to 0.70–0.74. In the third group includes the rest <strong>of</strong> the datasets (BD, CAD, HT,and T2D), achieving AUC no higher than 0.65. The lasso models achieved equivalent andsometimes substantially better AUC in predicting case/control status, compared with twoother <strong>approaches</strong>: multivariable logistic regression on SNPs chosen by univariable statisticsand risk scores produced by multivariable mixed-effects linear models by GCTA (Yang et al.,2011) (Appendix B.2.4).Based on the models’ AUC and estimates <strong>of</strong> heritability and prevalence, we estimated theproportion <strong>of</strong> phenotypic variance explained by the models on the liability scale (Wray et130


6.3. Results0.9AUC0.80.70.6DatasetBDCADCeliac1Celiac2−UKCrohnHTRAT1DT2D0.51 2 4 8 16 32 64 128 256 512 1024 2048Number <strong>of</strong> SNPs with non−zero effects(a) AUC0.350.30VarExp0.250.200.150.10DatasetBDCADCeliac1Celiac2−UKCrohnHTRAT1DT2D0.051 2 4 8 16 32 64 128 256 512 1024 2048Number <strong>of</strong> SNPs with non−zero effects(b) VarExpFigure 6.2.: (a) Area under the receiver operating characteristic Curve (AUC) <strong>for</strong> models<strong>of</strong> the 9 case/control datasets. Results are LOESS-smoothed over 20 × 3-foldcross-validation. See the Supplementary Results <strong>for</strong> details on each disease. (b)LOESS-smoothed proportion <strong>of</strong> phenotypic variance explained <strong>for</strong> the lasso models<strong>for</strong> the 9 discovery datasets, using the method <strong>of</strong> Wray et al. (2010).131


6. Sparse Linear Models Explain Phenotypic Variation and Predict Risk <strong>of</strong> Complex DiseaseCeliac2-Finn Celiac2-IT Celiac2-NL Celiac2-UKAUC VarExp AUC VarExp AUC VarExp AUC VarExpMean 0.870 0.300 0.824 0.214 0.850 0.258 0.846 0.25195% LCL 0.869 0.298 0.822 0.210 0.849 0.257 0.844 0.24995% UCL 0.871 0.302 0.826 0.217 0.850 0.260 0.847 0.254Table 6.3.: AUC and explained phenotypic variance <strong>for</strong> independent validation datasets <strong>of</strong>celiac disease models trained on Celiac1. We used models with ∼200 SNPs inthe model, trained in cross-validation on Celiac1 and tested on subsets <strong>of</strong> theCeliac2 dataset. LCL: lower confidence limit. UCL: upper confidence limit. Theproportion <strong>of</strong> explained phenotypic variance assumes population prevalence K =1%.al., 2010) (Figure 6.2b). We did not consider genetic variance, as this depends on estimates<strong>of</strong> heritability that may vary substantially, although, given a robust estimate <strong>of</strong> heritability,explained genetic variance can be easily obtained from the explained phenotypic variance. Forlasso models in cross-validation, the top two datasets in terms <strong>of</strong> explained phenotypic variancewere Celiac1/Celiac2-UK (∼ 32%) and WTCCC-T1D (∼ 28%). Models <strong>of</strong> RA explainedup to ∼ 10% <strong>of</strong> the variance, while the rest <strong>of</strong> the datasets achieved 5% or less.Independent replication <strong>of</strong> celiac disease and T1D modelsModels <strong>of</strong> celiac disease developed in the two discovery datasets (Celiac1 and Celiac2-UK)strongly replicated in three independent validation datasets without any further tuning. Basedon cross-validation results in Celiac1, we selected models with 200 non-zero SNPs, and usedthem to predict case/control status in the Celiac2-UK, Celiac2-IT, Celiac2-Finn, and Celiac2-NL datasets. To avoid any optimisation bias, we did not tune the model further on thesedatasets. Despite having different ancestries, and being on different microarray plat<strong>for</strong>ms,the lasso models trained on the Celiac1 dataset and tested on the four other datasets showedAUC ranging from 0.824 <strong>for</strong> Celiac2-IT to 0.87 <strong>for</strong> Celiac2-Finn, with corresponding explainedphenotypic variance ranging from 21.4% <strong>for</strong> Celiac2-IT to 30% <strong>for</strong> Celiac2-Finn (Table 6.3),showing that the predictive power is strongly retained in independent replication. Similarly,we trained models in cross-validation on the Celiac2-UK subset, again choosing all models with∼ 200 non-zero SNPs as indicated by cross-validation within that dataset, and tested them onthe three other Celiac2 subsets (Table 6.4), resulting in AUC <strong>of</strong> between 0.857 <strong>for</strong> Celiac2-ITand 0.901 <strong>for</strong> Celiac2-Finn, with the corresponding explained phenotypic variance <strong>of</strong> between27.3% <strong>for</strong> Celiac2-IT and 37.5% <strong>for</strong> Finn (assuming population prevalence K = 1%), againshowing strong agreement between datasets in terms <strong>of</strong> celiac disease risk prediction regardless<strong>of</strong> ethnic background.132


6.3. Results1.00.8PPV0.60.4●●●● ●●●Dataset●BDCADCeliac1Celiac2−UKCrohnHTRAT1DT2D●0.2●●●●●●0.88 0.90 0.92 0.94 0.96 0.98NPVFigure 6.3.: Lasso models can achieve high positive predictive values. PPV versus NPV <strong>for</strong>the lasso models <strong>of</strong> the 9 discovery datasets. Results are averaged over 20 × 3CV.See the Supplementary Results <strong>for</strong> the number <strong>of</strong> SNPs with non-zero coefficientsin each dataset. Note that the curves do not span the entire range <strong>of</strong> NPV sincenot all sensitivity and specificity values can be observed in a finite dataset.133


6. Sparse Linear Models Explain Phenotypic Variation and Predict Risk <strong>of</strong> Complex DiseaseCeliac2-Finn Celiac2-IT Celiac2-NLAUC VarExp AUC VarExp AUC VarExpMean 0.901 0.375 0.857 0.273 0.873 0.30795% LCL 0.900 0.373 0.856 0.270 0.873 0.30695% UCL 0.901 0.377 0.858 0.275 0.874 0.308Table 6.4.: AUC and explained phenotypic variance <strong>for</strong> independent validation datasets<strong>of</strong> celiac disease models trained on Celiac2-UK. Models were trained in crossvalidationon the UK subset <strong>of</strong> the Celiac2 datasets, and tested on the other threesubsets <strong>of</strong> the Celiac2 dataset. LCL: lower confidence limit. UCL: upper confidencelimit. The proportion <strong>of</strong> explained phenotypic variance assumes populationprevalence K = 1%.Cross-validation: WTCCC-T1DIndependent validation: GoKinD-T1DAUC VarExp AUC VarExpMean 0.842 0.219 0.842 0.21795% LCL 0.840 0.214 0.832 0.20195% UCL 0.850 0.223 0.852 0.233Table 6.5.: Models were trained in cross-validation on the WTCCC-T1D dataset and testedon the GoKinD-T1D dataset, using ∼ 100 SNPs in the model. The 95% confidenceinterval is derived from the LOESS fit. LCL: lower confidence limit. UCL:upper confidence limit. The proportion <strong>of</strong> explained phenotypic variance assumespopulation prevalence K = 0.54%.134


6.3. ResultsT1D models trained on the WTCCC-T1D dataset strongly replicated in the GoKinD-T1Ddataset (Table 6.5). We trained models in 20 × 3-fold cross-validation on the WTCCC-T1Ddataset (excluding the NBS control dataset which was paired with GoKinD-T1D), and selectedmodels with about 100 SNPs, based on the fact that AUC was maximised at that number <strong>of</strong>SNP in the model (AUC = 0.843). We then applied all models with ∼ 100 SNPs, without anyfurther tuning, to the GoKinD-T1D dataset (AUC = 0.842), with corresponding explainedphenotypic variance <strong>of</strong> 22% in both datasets (assuming population prevalence K = 0.54%).6.3.5. Genetic models in a population contextStatistical models <strong>of</strong> complex disease risk are usually trained on case/control datasets wherethe number <strong>of</strong> cases is much higher than the background prevalence in the general population.For example, the population prevalence <strong>of</strong> celiac disease is 1% (van Heel and West, 2006),whereas in the datasets used here the proportion <strong>of</strong> cases is ∼ 30%. There<strong>for</strong>e, while amodel may be able to accurately classify patients in our data, it may not have high enoughspecificity to be useful in a population context. To evaluate the precision <strong>of</strong> our models giventhe estimated population prevalence, we estimated the positive and negative predictive values(PPV and NPV) (Altman and Bland, 1994). PPV can be interpreted as the probability <strong>of</strong>having disease given a positive diagnosis, and NPV is the probability <strong>of</strong> not having diseasegiven a negative diagnosis. A perfect model should achieve PPV = 1 and NPV = 1. PPV andNPV were estimated over a range <strong>of</strong> cut<strong>of</strong>fs <strong>of</strong> the classifiers predictions, inducing a curve <strong>of</strong>all PPV/NPV value pairs, within cross-validation (Figure 6.3). For consistency, we used thesame population prevalence estimates as in Wray et al. (2010) (Supplementary Table B.3).For all diseases here, except the relatively common hypertension (assumed K = 13.1%),NPV was higher than 0.94 and can be assumed to always be high regardless <strong>of</strong> the predictivemodel used, since even a model producing random predictions will achieve high NPV whenthe prevalence is low — blindly predicting “no disease” regardless <strong>of</strong> genotype will be correctmost <strong>of</strong> the time. In contrast, PPV here depends largely on the model’s predictive abilityand less on the prevalence. Across the genetic models generated, PPV was non-uni<strong>for</strong>macross all samples (Figure 6.3)—<strong>for</strong> the majority <strong>of</strong> the samples it was low (PPV < 0.2),however <strong>for</strong> a small subset <strong>of</strong> samples the genetic models achieved both high positive andnegative prediction with NPV > 0.94 and PPV > 0.85. Stringent quality control measures aswell as external validation using the Celiac1/Celiac2-UK datasets did not change the results(Supplementary Figures B.9, B.10, B.11, B.12, B.13). The fact that high PPV was limited toa small number <strong>of</strong> samples shows that while the models <strong>of</strong> celiac disease and type 1 diabetescan discriminate cases from controls in the data examined here, these models are not suitable<strong>for</strong> population-<strong>wide</strong> screening as they would generate far too many false positives. The modelsmay be better targeted at screening sub-populations known to be at higher risk.135


6. Sparse Linear Models Explain Phenotypic Variation and Predict Risk <strong>of</strong> Complex Disease(a)Celiac1(b)Celiac1−stringent●PC26420−2−4−6●●● ●●● ●●●●●●●● ●●●●● ●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●● ●● ● ●●●●●●●●●●●●●●● ●●●●●● ●●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●● ●●●●●●● ●●●●●●●● ●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●● ●●●Specificity● >0.99●


6.4. Discussionconfidence — cases identified at 0.99 specificity or more are those that have been classifiedwith high confidence as cases. For Celiac1 and Celiac2-UK, the high specificity cases are highlyconcentrated in one <strong>of</strong> the three clusters uncovered by PCA. For T1D, the high specificitysamples can be stratified by PC4, however they are spread across the clusters in PC5.Note that in contrast to using PCA <strong>for</strong> assessing unknown population structure in the data,here we did not filter out high LD regions or thin by LD since some <strong>of</strong> these regions, suchas the MHC on chr6, are highly associated with the phenotype, and excluding them wouldremove much <strong>of</strong> the variation we are interested in finding. However, to address concerns <strong>of</strong>spurious case structure due to effects such as differential missingness, we repeated the sameprocedure <strong>for</strong> the stringently-filtered version <strong>of</strong> Celiac1, with similar results (SupplementaryFigure 6.4).6.4. DiscussionWe have described here an <strong>analysis</strong> <strong>of</strong> the per<strong>for</strong>mance <strong>of</strong> lasso-penalised multivariable models,simultaneous modelling <strong>of</strong> all SNPs from large-scale datasets, and show that such models canbe plausibly used <strong>for</strong> the purposes <strong>of</strong> disease prediction from SNP data. The models developedwere robust to different disease architectures and strongly replicated across independent diseasedatasets, in both celiac disease and type 1 diabetes, without any further tuning required.In addition, the top SNPs identified in the Celiac1/Celiac2-UK and WTCCC-T1D datasetsshowed high stability across the cross-validation replications (Supplementary Section B.3).Across all seven WTCCC and two celiac disease datasets, the lasso method achieved discriminationequal to or higher than both a combined screening/multivariable method and amultivariable mixed-effects linear model. Other possible <strong>approaches</strong> include logistic regressionwith <strong>for</strong>ward or backward elimination <strong>of</strong> SNPs (Ayers and Cordell, 2010), however, thesemethods can be highly unstable, leading to different solutions based on the order in whichthe SNPs were added or removed, and getting stuck in local optima (Hastie et al., 2009a).For celiac disease, T1D, and to a lesser extent RA, the major-histocompatibility complex(MHC) region <strong>of</strong> chr6 contains most <strong>of</strong> the predictive ability from the SNPs we have considered(Supplementary Figures B.20, B.21, B.25, and B.24). This has also been observed byothers (Wei et al., 2009). There<strong>for</strong>e, <strong>for</strong> prediction <strong>of</strong> some autoimmune diseases, a reasonableand cost-effective approach is to focus on the MHC. The discriminatory power <strong>of</strong> our methodon T1D and Crohn’s disease was similar to or better than those reported by others (Kooperberget al., 2010; Roshan et al., 2011; Wei et al., 2009). Some <strong>of</strong> these studies used lassomodels as well; the small differences may be partially explained by the fact that they prescreenedthe SNPs whereas we did not. We have shown that univariable pre-screening <strong>of</strong> theSNPs can result in reduced AUC compared with simultaneously modelling all SNPs, with thedifference especially apparent in the Crohn’s disease dataset (Supplementary Figure B.22).137


6. Sparse Linear Models Explain Phenotypic Variation and Predict Risk <strong>of</strong> Complex DiseaseWhen investigating case substructure and accounting <strong>for</strong> population prevalence, we haveshown that our genetic models can identify and target particular subsets <strong>of</strong> cases with highpredictive ability. Within the data, a genetically distinct subset <strong>of</strong> ∼ 11% <strong>of</strong> celiac disease casesand ∼13% <strong>of</strong> T1D cases could be predicted with > 99% specificity, however the low diseaseprevalence in the population requires even higher specificity if population-<strong>wide</strong> screening isto accurately capture the whole subset. While the absolute numbers <strong>of</strong> high PPV subsetswere not large in the collections assessed here (N ∼ 5–11), extrapolating the size <strong>of</strong> geneticallypredictable disease in a population shows these genetic models may have non-trivial potential<strong>for</strong> predicting disease while maintaining a low number <strong>of</strong> false positives. For example, givena disease with 1% prevalence in a population <strong>of</strong> 100 million and a genetic model that canpredict 5 out <strong>of</strong> 2000 cases from a random affected sample, it can be estimated that ∼2,500cases could be highly confidently predicted as disease positive. Further, since these subsetscontain pr<strong>of</strong>iles which are highly genetically differentiated, we there<strong>for</strong>e hypothesise thatthese cases represent a particular subtype(s) <strong>of</strong> disease which appears to have either a greatergenetic basis or one that is better captured by common SNPs than the remaining cases.6.5. ConclusionsOur <strong>analysis</strong> <strong>of</strong> eight <strong>human</strong> diseases has produced a spectrum <strong>of</strong> models with different predictiveabilities: models with high predictive ability (T1D and celiac disease), medium (RAand Crohn’s disease), and low (the rest), with substantially differing numbers <strong>of</strong> SNPs in themodels, ranging from < 100 SNPs <strong>for</strong> T1D and celiac disease to > 2000 SNPs <strong>for</strong> T2D, CAD,and BD. These results, together with the case substructure in T1D and celiac disease, suggestthe genetic architecture <strong>of</strong> complex disease may be more heterogeneous than previouslythought, both between diseases and within the same disease. More work is needed to bettercharacterise these subtypes and to develop genetic models that can better predict the remainingcases. It remains to be seen whether genotypic risk prediction in such subpopulationwill have increased predictive power over the traditional risk factors such as family history.Based on AUC and estimates <strong>of</strong> heritability, our models <strong>of</strong> celiac disease and T1D explainedsubstantial proportions <strong>of</strong> the phenotypic variance. Importantly, the results <strong>for</strong> celiac diseaseand T1D indicate that, even without explaining all <strong>of</strong> or even the majority <strong>of</strong> phenotypicvariance, predictive models still have predictive value. Taken together, our results indicatethat the amount <strong>of</strong> missing variation in <strong>human</strong> complex disease, either phenotypic or genetic,may be currently overestimated and that supervised learning <strong>approaches</strong> can be used to addressthis issue. With the more complete pr<strong>of</strong>iles <strong>of</strong> genomic variation being generated byhigh-throughput sequencing, genetic models <strong>of</strong> <strong>human</strong> disease will only increase in predictivepower, bringing the promise <strong>of</strong> genomics closer to clinical application.138


6.5. ConclusionsFrom a computational and statistical perspective, this work demonstrates the utility <strong>of</strong>multivariable statistical models in genetic risk prediction, fitted to all SNPs rather than justa list <strong>of</strong> pre-filtered candidates. Specifically, we have shown that lasso penalised modelsfitted to all SNPs are both practical and useful in the genetic setting, and that they achievepredictive ability equivalent to or better than models built on top <strong>of</strong> pre-filtered candidates.139


7Characterising the Genetic Control <strong>of</strong> HumanMetabolic GenesIn previous chapters we explored the use <strong>of</strong> gene expression data <strong>for</strong> prediction <strong>of</strong> breastcancer metastasis and <strong>of</strong> genetic data <strong>for</strong> predicting complex disease status. However, consideringthe gene expression and genetic aspects in isolation provides only narrow insight intothe underlying biological mechanisms <strong>of</strong> disease. An integrated <strong>analysis</strong> <strong>of</strong> several data typescan potentially provide better understanding <strong>of</strong> the hidden relationships between the cellularcomponents and biological processes, and the links between these processes and the observedclinical phenotypes. In this chapter, we per<strong>for</strong>m an integrative <strong>analysis</strong> <strong>of</strong> SNPs, gene expression,and metabonomic data from a <strong>human</strong> population cohort, based on the sparse linearmodels discussed earlier.7.1. IntroductionThe availability <strong>of</strong> gene expression, genetic, and metabonomic datasets has allowed integratedanalyses <strong>of</strong> the relationships between genetic variation and gene expression (Mackayet al., 2009; Stranger et al., 2007) and between genetic variation and metabolites such asblood lipids (Ferrara et al., 2008; Inouye et al., 2010a,b; Surakka et al., 2011; Teslovich etal., 2010; Tukiainen et al., 2011). These datasets can be further integrated with high-levelclinical phenotypes such as occurrence <strong>of</strong> disease (Holmes et al., 2008; Nicholson and Lindon,2008). While such studies have produced valuable insights into the regulatory mechanisms141


7. Genetic Control <strong>of</strong> the Human Metabolic Gene Regulation<strong>of</strong> gene expression and metabolism, many <strong>of</strong> these analyses have mainly been concerned withdetecting expression quantitative loci (eQTL) (discussed in Section 2.4.8) <strong>of</strong> genes or eQTL<strong>of</strong> metabolites, but not in linking all three aspects together into putative pathways.Inouye et al. (2010a) did per<strong>for</strong>m one such integrated <strong>analysis</strong> <strong>for</strong> a specific set <strong>of</strong> genes,considering the lipid leukocyte (LL) gene module, and inferring the effects <strong>of</strong> SNPs on moduleexpression and its interactions with serum metabolite levels. The LL module consisted <strong>of</strong> 11genes and was derived from clustering <strong>of</strong> gene expression <strong>for</strong> the top 10% <strong>of</strong> genes (3520genes) associated with seven lipid measurements. Next, cis- and trans-QTLs associated withthis module were detected. Finally, structural equation modelling was used to infer causalnetworks <strong>of</strong> lipids and LL genes. Later, Inouye et al. (2010b) expanded this <strong>analysis</strong> to allmetabolites in the data, identifying associations <strong>of</strong> the LL module with levels <strong>of</strong> lipoproteins,lipids, glycoproteins, isoleucine, and 3-hydroxybutyrate, and inferring the causal structure <strong>of</strong>these associations. These findings further highlighted the relationship between mechanisms<strong>of</strong> immunity and metabolism, suggesting causal mechanisms <strong>for</strong> this link. In this chapter,we expand the <strong>analysis</strong> <strong>of</strong> the DILGOM dataset, employing lasso linear models to detectgene-metabolite and gene-SNP associations in the data, covering all assayed metabolites (136metabolites), genes (35,419 genes), and autosomal SNPs (544,538 SNPs).As with SNP analyses <strong>of</strong> case/control phenotypes, many studies utilise the univariableapproach, where each SNP is individually tested <strong>for</strong> association with each gene expressionlevel, using a statistical hypothesis test to assess statistical significance, and possibly applyinga multiple testing correction in order to control the false positive rate. We have shown inChapter 6 that the univariable testing approach can result in higher false positive associationsthan multivariable models that consider all SNPs together. In contrast with the univariablehypothesis testing paradigm, we approach the problem from a predictive perspective, usinglasso-penalised multivariable models <strong>for</strong> the two tasks <strong>of</strong> (i) feature selection — identifyinggene expression associated with metabolite levels, or SNPs associated with gene expressionlevels — and <strong>of</strong> (ii) modelling <strong>of</strong> the effects themselves. Rather than finding predictors thatare statistically significant, we focus on finding subsets that are highly predictive, in the sense<strong>of</strong> explaining the most variation in the phenotype. Our aim here is to per<strong>for</strong>m an unbiased<strong>analysis</strong> <strong>of</strong> all assayed genes and SNPs, rather than candidate sets, with the aim <strong>of</strong> derivingnew associations between SNPs, genes, and metabolites. Our sub-goals include:• to estimate how much <strong>of</strong> the variation in metabolite levels can be explained by genes,and in turn, how much <strong>of</strong> the variation in gene expression can be explained by SNPs;• to compare the degree <strong>of</strong> genetic control <strong>of</strong> genes associated with the metabolome comparedwith non-associated genes;• to generate plausible hypotheses <strong>of</strong> the mechanisms <strong>of</strong> genetic regulation <strong>of</strong> metabolitelevels, as mediated by gene expression levels;142


7.2. Methods• to link the genetic, gene expression, and metabolic data back to clinically relevantphenotypes related to disease.We make several simplifying assumptions in our analyses. First, we assume that geneticfactors (SNPs) are always “upstream” <strong>of</strong> gene expression and metabolites, in the sense thatSNPs can be causal factors <strong>of</strong> gene expression and metabolites, but not the other way around.This assumption is based both on current understanding <strong>of</strong> molecular biology, in that SNPsare largely invariant throughout the life <strong>of</strong> an organism, and underlies <strong>genome</strong>-<strong>wide</strong> associationstudies (GWAS) and Mendelian genetics in general. This SNP causal effect may not be directbut can be mediated by other factors, unless the SNP is a cis-QTL in which case we assumethat the effect is direct. In contrast, the question <strong>of</strong> causality in associations between geneexpression and metabolite levels is less clear, as both factors can potentially be causal to eachother. However, we make a simplifying assumption that the genes mediate between SNPs andmetabolites levels, an assumption likely to hold <strong>for</strong> at least some genes as metabolites are notcoded <strong>for</strong> in the DNA and there<strong>for</strong>e SNP effects on them, if they exist, must be indirectlymediated through gene expression. While feedback loops (metabolites affecting genes) areknown to occur in regulatory and metabolic networks (Alon, 2007), these are more difficultto model, and require time series data, RNA knockdown, or gene knockout data in order toinfer the causal structure.Our analytical pipeline is outlined in Figure 7.1. Briefly, we begin by lasso regression <strong>of</strong>each metabolite on all genes and all clinical variables. We then select the metabolites thatwere predicted with highest R 2 . For these metabolites, we then keep the genes selected in eachmodel. In turn, we regress each gene on age and gender using unpenalised linear regression,and keep the residuals from each gene. Next, we regress the residuals <strong>for</strong> each gene on allSNPs, to detect QTLs affecting gene expression. Finally, we combine the results in a network<strong>analysis</strong> to derive hypotheses <strong>of</strong> genetic regulation <strong>of</strong> metabolites, focusing on metabolitesassociated with fasting glucose levels, a key determinant <strong>of</strong> type 2 diabetes.7.2. MethodsHere we describe the dataset used in this study and the lasso-penalised models used to detectthe associations.7.2.1. DataThe Dietary, Lifestyle, and Genetic determinants <strong>of</strong> Obesity and Metabolic syndrome (DIL-GOM) dataset contains 509 <strong>human</strong> samples in total (234 males, 275 females) in commonbetween the gene expression, SNP, and metabolite data. The individuals were randomlysampled from the Finnish population (Inouye et al., 2010b). The data include:143


7. Genetic Control <strong>of</strong> the Human Metabolic Gene RegulationMetabolitesGenesClinicalSNPsRemove clinical effect from metabolitesRegress each metabolite on all genesGet predictable metabolitesGet stable genesRemove effect <strong>of</strong> gender+age from genesRegress each gene on all SNPsGet predictable genesGet stable SNPsInfer causal networksFigure 7.1.: Schematic diagram <strong>of</strong> our <strong>analysis</strong> pipeline.144


7.2. Methods• Concentrations <strong>of</strong> 136 serum metabolites and derived quantities, such as lipids, aminoacids, and glucose levels, measured using 1 H nuclear magnetic resonance (NMR) spectroscopy(Ala-Korpela, 2008). The derived quantities are not metabolites per se, butratios <strong>of</strong> other metabolites that have been previously found to be biologically meaningful.For convenience, we refer to all the measurements as “metabolites”. Missing valueswere imputed to the median <strong>for</strong> each metabolite. Zero values were set to the smallestnon-zero value <strong>for</strong> each metabolite, to prevent missing values later. The metaboliteswere trans<strong>for</strong>med using the Box-Cox trans<strong>for</strong>mation to achieve approximate normalityand homoscedasticity (the trans<strong>for</strong>mation parameter λ was automatically determinedusing the Guerrero method (Guerrero, 1993) in BoxCox.lambda in R package <strong>for</strong>ecast).Note that while the Box-Cox trans<strong>for</strong>mation can make the data more approximatelynormallydistributed, thus reducing the effects <strong>of</strong> outliers and stabilising the variance,it changes the scale <strong>of</strong> the data, so that the trans<strong>for</strong>med data are not on the originalmeasurement scale.• Whole blood gene expression <strong>of</strong> over 35,419 probes, from the Illumina HT-12 ExpressionBeadChip, on log scale. The data were normalised using quantile normalisation (Bolstadet al., 2003).• 544,538 autosomal SNPs, using the Illumina 610-quad SNP array. The quality controlprocedure is detailed in Inouye et al. (2010b). Missing values were randomly imputedwithin the SparSNP <strong>analysis</strong>.• 21 clinical variables: including gender, age, BMI, weight, height, waist circumference,CRP (C-reactive protein) levels, insulin levels, cholesterol and hypertension loweringmedication, smoking, alcohol intake, blood pressure, and others. Missing values wereimputed to the median <strong>for</strong> each variable.7.2.2. Predictive ModellingWe employed linear models to model metabolites as functions <strong>of</strong> gene expression, and geneexpression as functions <strong>of</strong> genetic variation. The linear model is <strong>for</strong>mulated asy i = β 0 + x T i β + ɛ i , i = 1, ..., N, (7.1)where y i ∈ R and x i ∈ R p are the ith input and output, respectively, β 0 ∈ R and β ∈ R p arethe intercept and weights, respectively, ɛ i ∼ N (0, σ 2 ) is iid Gaussian noise, N is the number<strong>of</strong> samples, and p is the number <strong>of</strong> model weights.For unpenalised models, the model is fitted using maximum likelihood (least squares).For the lasso-penalised models, the model is fitted using penalised maximum likelihood, asimplemented in the R package glmnet and the tool SparSNP (discussed in Chapter 5).145


7. Genetic Control <strong>of</strong> the Human Metabolic Gene RegulationModel EvaluationasWe measure the predictive ability <strong>of</strong> a given model using the R 2 , definedR 2 = 1 − ∑N i=1 (y i − ŷ i ) 2∑ N i=1 (y i − ȳ) 2 , (7.2)where y i , ŷ i , and ȳ are the ith output, ith predicted output, and the average output ȳ =1/N ∑ N i=1 y i, respectively. Note that under this definition R 2 ∈ (−∞, 1]. Negative R 2 indicatesthat the model is worse than the model that includes the intercept only, usually caused byoverfitting <strong>of</strong> the model. R 2 is a function <strong>of</strong> the mean squared error (MSE) but unlike MSE iscomparable across outputs with different variance. Although R 2 is not strictly a proportion,we can take R 2 to be the explained proportion <strong>of</strong> the variance by truncating negative R 2 tozero, as negative R 2 means that the model is not explaining any variation.Models <strong>of</strong> the Metabolitesdifferent inputs:We considered two model classes <strong>for</strong> the metabolites, using1. Gene expression together with clinical variables2. Gene expression after clinical variable effects have been removedThe clinical variables were included in order to minimise potential confounding caused bydifferences in gene expression between groups such as different genders and different ages,and to assess how much <strong>of</strong> the remaining variation in metabolite levels can be explainedby gene expression, above variation explained by other clinically important variables such asinsulin levels and CRP (C reactive protein) levels. These models are not equivalent due to thelasso penalisation: in model 1, both genes and clinical variables can be selected with non-zeroweights, and the resulting model will potentially contain both. In contrast, in model 2, we firstremove the additive component <strong>of</strong> the effect <strong>of</strong> the clinical variables using unpenalised linearregression, and then model the residual variation in the metabolite using a lasso-penalisedmodel <strong>of</strong> gene expression.Thus, model 1 is useful <strong>for</strong> assessing the relative importance <strong>of</strong> each variable <strong>for</strong> predictingthe metabolite, but the R 2 is due to both clinical variables and genes, and there<strong>for</strong>e does notreflect how much <strong>of</strong> the variation is due to each group <strong>of</strong> variables alone. The aim <strong>of</strong> model 2is to estimate the partial contribution gene expression to predicting the metabolite, while stillaccounting <strong>for</strong> the effects <strong>of</strong> the clinical variables. The R 2 <strong>for</strong> model 2 is equivalent to thesemi-partial R 2 (Cohen et al., 2003), which can be interpreted in<strong>for</strong>mally as the explainedproportion <strong>of</strong> variance remaining after removing the metabolite variation attributable to theclinical variables.In evaluating the models’ predictive ability, there is the possibility <strong>of</strong> overfitting, which isthe situation where the predictive ability in the training data is substantially higher than inthe test data, essentially due to the model fitting to the noise rather than the signal. We146


7.2. Methodsreduce the degree <strong>of</strong> overfitting by penalisation, chosen by cross-validation, and by usingindependent testing data <strong>for</strong> estimating the R 2 .To estimate R 2 <strong>for</strong> models 1 and 2 we used nested 10×10×10-fold cross-validation (Ambroiseand McLachlan, 2002): split the data into 10 folds, run 10-fold cross-validation in each <strong>of</strong>the 10 training folds to select the penalty (from the model with highest R 2 ), train a modelon the entire training data using the chosen penalty, and test this model on the unseen testfold. This process is repeated 10 times. The final R 2 values are averaged over the test setR 2 in the 10 replications. For model 2, we used nested cross-validation but with a two-stagemodel, first removing the effect <strong>of</strong> the clinical variables, as follows:1. Split the data into training and testing folds.2. In the training data, we regressed each metabolite on the clinical variables using unpenalisedlinear regression.3. We ran lasso on the training data within 10-fold cross-validation, over a grid <strong>of</strong> penaltiesλmax, ..., λ min , using the gene expression data as input and the residuals from theclinical variables as output. The best model was chosen as the model with highest R 2 .4. We applied the model <strong>of</strong> clinical variables to the 10% independent test data to derive aprediction <strong>of</strong> the metabolite levels, and the residuals <strong>for</strong> the test data.5. We ran lasso on the entire 90% training set using the optimal penalty λ ∗ , and testedthe model on the residuals <strong>of</strong> the 10% test data.6. Repeated steps 1–5 ten times.7. The final R 2 values were averaged over the test set R 2 in the multiple replications.Models <strong>of</strong> Gene Expression We used lasso-penalised linear models in the tool SparSNP(Chapter 5) to model gene expression as a function <strong>of</strong> the SNPs (expression QTL).To estimate R 2 <strong>for</strong> regressing the genes on the SNPs, we1. Regressed each gene on gender and age, using an unpenalised linear model.2. Obtained the residuals <strong>for</strong> each gene.3. Regressed the residuals <strong>for</strong> each gene on all SNPs, using lasso-penalised linear models,repeated within 30 × 10-fold cross-validation.We then ranked the SNPs selected by the lasso method over the cross-validation replications,based on the proportion <strong>of</strong> replications in which they were selected.147


7. Genetic Control <strong>of</strong> the Human Metabolic Gene Regulation7.2.3. Causal Network InferenceBased on the predictive genes <strong>for</strong> the metabolites and the predictive SNPs <strong>for</strong> the genes, weinferred association networks <strong>of</strong> genes and SNPs <strong>for</strong> each metabolite. We refer to the process<strong>of</strong> determining the causal direction <strong>of</strong> each edge as orienting the edge. The edges between theSNPs and the genes are directed (causal), as we assume that SNPs can affect gene expressionbut gene expression cannot induce genetic variation.In contrast, the edges between the genes and the metabolites cannot be oriented fromthe observed associations alone — we cannot determine whether gene expression regulatesmetabolite levels or whether it is the other way around without external in<strong>for</strong>mation. Suchexternal in<strong>for</strong>mation is provided by cis-QTLs, assumed to be causal anchors (Aten et al.,2008; Schadt et al., 2005): direct causal regulators <strong>of</strong> gene expression <strong>for</strong> the gene they areaffect (Veyrieras et al., 2008), which can be used to orient the edges <strong>for</strong> the rest <strong>of</strong> the graph.Based on this assumption, any observed association between a gene and its cis-QTL is theresult <strong>of</strong> a direct causal effect and not due to confounding with other genes or SNPs. We alsoassume that each cis-QTL can directly affect only one gene. In our <strong>analysis</strong> we considereda SNP associated with a gene to be a cis-QTL if it resides within a 1Mb-<strong>wide</strong> window <strong>of</strong>the probe’s centre (Stranger et al., 2005). We define a SNP associated with a gene to be atrans-QTL if it is on another chromosome or is not a cis-QTL.There are five possible graphs describing the causal relationships between one SNP (notnecessarily a cis-QTL), one gene, and one metabolite:1. SNP → gene → metabolite2. SNP → metabolite → gene3. SNP → gene ← metabolite4. SNP → metabolite ← gene5. gene ← SNP → metaboliteWe assume that all other graphs involving an edge into the SNP are not possible, as theSNP is always a causal factor. When the model show that the SNP is a causal factor <strong>of</strong> themetabolite, it does not mean a direct effect, but it can be an indirect effect through othergenes, <strong>for</strong> example. Further to assuming SNPs being causal (edges out <strong>of</strong> SNPs not intoSNPs), by restricting our <strong>analysis</strong> to SNPs that are cis-QTLs we can remove models 2 and4, since they imply an effect <strong>of</strong> a SNP on the metabolite but not on the gene, contradictingour assumption <strong>of</strong> direct effect on the gene.Each regulatory model is characterised by a specific pattern <strong>of</strong> marginal and conditionalindependences (Table 7.1). Under the assumptions <strong>of</strong> approximate normality <strong>of</strong> the variablesand linearity <strong>of</strong> the effects, these independences translate to marginal correlations and partial148


7.2. MethodsModelMarginalindependenceConditionalindependenceCorrelationPartial correlation(1) S→G→M S̸G,S̸M,M̸GS̸G∣M,SM∣G,M̸G∣Scor(S,G)≠ 0,cor(S,M)≠ 0,cor(M,G)≠ 0pcor(S,G)≠ 0,pcor(S,M)= 0,pcor(M,G)≠ 0(2) S→M→G S̸G,S̸M,M̸GSG∣M,S̸M∣G,M̸G∣Scor(S,G)≠ 0,cor(S,M)≠ 0,cor(M,G)≠ 0pcor(S,G)= 0,pcor(S,M)≠ 0,pcor(M,G)≠0(3) S→G←M SM,S̸G,M̸GS̸M∣G,S̸G∣M,M̸G∣Scor(S,M)= 0,cor(S,G)≠ 0,cor(M,G)≠ 0pcor(S,M)≠ 0,pcor(S,G)≠ 0,pcor(M,G)≠ 0(4) S→M←G SG,S̸M,M̸GS̸G∣M,S̸M∣G,M̸G∣Scor(S,G)= 0,cor(S,M)≠ 0,cor(M,G)≠ 0pcor(S,G)≠ 0,pcor(S,M)≠ 0,pcor(M,G)≠ 0(5) G←S→M G̸M,G̸S,M̸SGM∣S,G̸S∣M,M̸S∣Gcor(G,M)≠ 0,cor(G,S)≠ 0,cor(M,S)≠ 0pcor(G,M)= 0,pcor(G,S)≠ 0,pcor(M,S)≠ 0Table 7.1.: The marginal and conditional independence statements that can be derived fromthe (SNP, Gene, Metabolite) graph, and the corresponding correlation and partialcorrelations.correlations, respectively (Whittaker, 1990). There<strong>for</strong>e, based on the pattern <strong>of</strong> marginal andpartial correlations we can discriminate between the three models <strong>of</strong> regulation (Figure 7.2).We estimated the 3 × 3 partial correlation matrix Π <strong>of</strong> each (SNP, gene, metabolite) tripletby the Moore-Penrose pseudoinverse <strong>of</strong> the correlation matrix Σ (obtained from the singularvalue decomposition <strong>of</strong> Σ), denoted Σ † , and then normalising the entries such thatΠ ij =⎧⎪⎨⎪⎩−Σ † ij√Σ † ii Σ† jjif i ≠ j1 otherwise.(7.3)Since the correlations and partial correlation are observed in noisy data, we must use ameasure <strong>of</strong> statistical significance to decide whether they are significantly different from zero.We used Fisher’s z-trans<strong>for</strong>mation (Fisher, 1921, 1924) to trans<strong>for</strong>m the correlation (or partialcorrelation) r to achieve approximate normalityF (r) = 1 2√N − 3 log (1 + r1 − r ) , (7.4)149


7. Genetic Control <strong>of</strong> the Human Metabolic Gene Regulationpcor(S,M) = 0noyespcor(G,M) = 0(1) S->G->Myesno(5) GM(3) S->G


7.3. ResultsR 20.350 0.355 0.360 0.365 0.370 0.375 0.380 0.385S_VLDL_TG● ●VLDL_TG● Serum_TG●VLDL_TG_eFR●TGtoPG●M_VLDL_L● CH2 groups <strong>of</strong> mobile lipidsIsoleucineR 20.0 0.2● ●●●●0 40 80 120RankM_VLDL_P●VLDL_D0 2 4 6 8 10 12 14Rank(a) model 1R 20.16 0.17 0.18 0.19●CH2 groups <strong>of</strong> mobile lipidsSerum_TG●● VLDL_D●VLDL_TG●VLDL_TG_eFR●R 2−0.05 0.05 0.15●●●● ●0 40 80 120RankM_VLDL_L● M_VLDL_P● S_VLDL_TG● M_VLDL_PLM_VLDL_TG2 4 6 8 10 12Rank(b) model 2●Figure 7.3.: R 2 <strong>for</strong> regressing the metabolites on all gene probes, together with all clinicalvariables (model 1), or after removing the effect <strong>of</strong> the clinical variables (model 2),showing the top 10 <strong>for</strong> each model. The results <strong>for</strong> all metabolites are shown inthe insets. Metabolites were sorted in descending order <strong>of</strong> R 2 . R 2 was estimatedwith nested cross-validation. Note the different scales.as the effect <strong>of</strong> the clinical variables has been removed: out <strong>of</strong> the 136 metabolites, 11 hadR 2 ≥ 0.15 and 22 had R 2 ≥ 0.1. Overall, the metabolite R 2 had a Spearman rank-correlation<strong>of</strong> 0.782 between model 1 and model 2. Out <strong>of</strong> the top 10 metabolites in model 1, 7 wereretained in the top 10 metabolites <strong>for</strong> model 2.In order to find which variables were selected in each metabolite model, we tabulated thevariables (gene expression probes or clinical variables) selected by the lasso model <strong>for</strong> eachmetabolite. Since the cross-validation replications are split randomly, a gene may be selectedin one replication but not in another. There<strong>for</strong>e, we kept only the stable markers — those thatwere consistently included in the model in ≥ 50% <strong>of</strong> the cross-validation replications. Therewere a total <strong>of</strong> 1137 and 504 unique markers stably selected <strong>for</strong> models 1 and 2, respectively,however some probes recurred across many models more than others. For example, HDC(histidine decarboxylase) was the most commonly selected gene, stably selected <strong>for</strong> 72 out<strong>of</strong> the 136 metabolites (53%) in model 1, and <strong>for</strong> 61 (45%) <strong>of</strong> metabolites in model 2. Formodel 1, the most commonly selected marker was waist circumference (Waist circum), selectedin 86 (63%) <strong>of</strong> the metabolites. The top ranked markers are shown in Figure 7.4, using tworankings: (i) the proportion <strong>of</strong> metabolites in which the marker was selected in, and (ii) theproportion weighted by the R 2 , in order to upweight markers associated with metabolites thatcan be better predicted over markers that are common but are associated with metabolites151


7. Genetic Control <strong>of</strong> the Human Metabolic Gene RegulationProportion <strong>of</strong> metabolite models0.3 0.4 0.5 0.6●Waist_circum/●ILMN_1792323/HDC●ILMN_1892403/SNORD13●Waist_hip/ILMN_1688423/FCER1A●●Prop. <strong>of</strong> models0.0 0.2 0.4 0.6ILMN_1766054/ABCA1●●●●●● ●●●●●0 200 600 1000RankILMN_1682402/SNORD46 ILMN_1699071/C21orf7● ●Gender/●ILMN_1654566/HSPA1L ●2 4 6 8 10 12Rank(a) Ranked by proportion, model 1Proportion <strong>of</strong> metabolite models0.15 0.20 0.25 0.30 0.35 0.40 0.45●ILMN_1792323/HDC●ILMN_1892403/SNORD13ILMN_1688423/FCER1A●●ILMN_1695530/MS4A3●ILMN_1699071/C21orf7●ILMN_1766054/ABCA1●2 4 6 8 10 12RankProp. <strong>of</strong> models0.0 0.1 0.2 0.3 0.4●●●● ●●●●0 100 300 500RankILMN_1682402/SNORD46ILMN_1859259/● ● ●ILMN_1895673/ILMN_1899555/(b) Ranked by proportion, model 2●Weighted proportion <strong>of</strong> metabolite models0.06 0.07 0.08 0.09 0.10 0.11●ILMN_1792323/HDC●Waist_circum/● ILMN_1892403/SNORD13●ILMN_1688423/FCER1A●Waist_hip/●ILMN_1766054/ABCA1●Weighted prop. <strong>of</strong> models0.00 0.04 0.08ILMN_1682402/SNORD46●ILMN_1654566/HSPA1L●●●●●●●●●●●●●●●0 200 400 600 800RankILMN_1699071/C21orf7●ILMN_1834452/Weighted proportion <strong>of</strong> metabolite models0.020 0.025 0.030 0.035●ILMN_1792323/HDC●ILMN_1892403/SNORD13ILMN_1699071/C21orf7● ● ILMN_1766054/ABCA1ILMN_1695530/MS4A3●●●●Weighted prop. <strong>of</strong> modelsILMN_1688423/FCER1AILMN_1682402/SNORD46ILMN_1859259/●0.00 0.01 0.02 0.03● ● ●●●●●●●0 100 300 500ILMN_1895673/●RankILMN_1654566/HSPA1L2 4 6 8 10 12Rank(c) Ranked by R 2 × proportion, model 12 4 6 8 10 12 14Rank(d) Ranked by R 2 × proportion, model 2Figure 7.4.: The top 10 variables (metabolites+genes <strong>for</strong> model 1 and genes <strong>for</strong> model 2)as selected as predictors <strong>of</strong> metabolite variation in models 1 and 2. The geneswere ranked by the proportion <strong>of</strong> metabolites <strong>for</strong> which each gene was selectedby the lasso regression (a, b) or the proportion times the R 2 <strong>for</strong> the correspondingmetabolite (c, d), in order to upweight genes that are not only included aspredictors <strong>of</strong> many metabolites but are also more highly predictive. Each insetshows all the variables <strong>for</strong> each model. Note the different scales.152


7.3. Resultswith low R 2 . For both model 1 and model 2, 9 <strong>of</strong> the 10 top markers ranked by proportionremained in the top 10 when ranked by the weighted score, indicating that these markers areboth shared by a substantial proportion <strong>of</strong> the metabolite models and that these metabolitemodels are the ones with highest R 2 . Between models 1 and 2, 6 and 7 <strong>of</strong> the top 10 markerswere shared in the proportional and weighted rankings, respectively.The Relative Contribution <strong>of</strong> Genomic and Clinical Factors to Metabolite Variation Wesought to assess how much <strong>of</strong> the explained variation in each metabolite was due to clinicalfactors, such as age, gender, and BMI, and how much was attributable to gene expression,under an assumption <strong>of</strong> additive effects. If the gene expression can explain further variationin each metabolite, above what was explained by the clinical factors, this indicates that thegene expression is capturing some aspect <strong>of</strong> the metabolite not explained by traditional clinicalindicators.To assess whether the ability to predict each metabolite’s was largely attributable to clinicalfactors or to gene expression, we compared the R 2 <strong>for</strong> each metabolites between model 1 and 2,ranking them by the ratio <strong>of</strong> R 2 from model 2 to R 2 from model 1, Rmodel 2 2 /R2 model 1 , afterremoving metabolites with zero or negative R 2 in model 1, as these indicate that the modelhas no explanatory power (we removed 38 such metabolites). Metabolites with ratios > 0.5indicate that the contribution <strong>of</strong> gene expression to the predictive ability <strong>of</strong> the model ishigher than the contribution <strong>of</strong> the clinical variables, and the converse holds <strong>for</strong> ratios < 0.5.Ratios <strong>of</strong> zero indicate no gene contribution at all, as all the variance that could be explainedhas been explained by the clinical variables and adding the gene expression does not improvethe model.Of the remaining 98 metabolite measurements, 60 had non-zero ratios (Figure 7.5), indicatingsome genomic contribution to the model predictiveness. Of the 60, nine showed R 2 ratios≥ 0.5: lactate, 3-hydroxybutyrate, acetoacetate, CH2 groups <strong>of</strong> mobile lipids, total fatty acids(TotFA), mean diameter <strong>for</strong> very low density lipoproteins particles (VLDL D), cholesterol estersin medium VLDL (M VLDL CE), serum triglycerides (Serum TG), and total cholesterolin medium VLDL (M VLDL C). Lactate was the only metabolite that had slightly better R 2in model 2 than in model 1 (0.115 versus 0.107, respectively), however, the difference maybe due to randomisation in the cross-validation procedure. Thus, <strong>for</strong> these nine metabolites,we estimate that the majority <strong>of</strong> the phenotypic variance is explained by gene expressionfactors rather than by clinical factors. While the estimates <strong>of</strong> R 2 are averaged over multiplenested cross-validation and thus less likely to be spurious than estimates derived withoutcross-validation, however, it may be useful to explore whether there is a multiple testing issuedue to considering many metabolites.Partitioning the Metabolites into SubtypesMany <strong>of</strong> the metabolites analysed here areknown to be functionally related to each other and can be partitioned into subtypes. We used153


7. Genetic Control <strong>of</strong> the Human Metabolic Gene RegulationR 2 ratio0.5 0.6 0.7 0.8 0.9 1.0●Lactate●0 20 40 60 80 100Metabolite rank3−hydroxybutyrate●R 2 ratio0.0 0.2 0.4 0.6 0.8 1.0Acetoacetate●● CH2 groups <strong>of</strong> mobile lipids●TotFA●VLDL_DM_VLDL_CESerum_TG●● ●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●● ●●●M_VLDL_C2 4 6 8 10 12Metabolite rankFigure 7.5.: Ratio <strong>of</strong> R 2 in model 2 to R 2 in model 1 <strong>of</strong> metabolites, sorted in decreasingorder. Large figure: metabolites with ratios ≥ 0.5. Inset: all 98 metabolites withpositive R 2 in model 1.154


7.3. Resultshierarchical clustering with complete linkage, and defined eight clusters (Figure 7.6), basedon visual inspection <strong>of</strong> the plot. The clusters showed substantial differences in terms <strong>of</strong> R 2<strong>for</strong> model 2 (Figure 7.7), with clusters 1, 2, and 7 having highest median R 2 . Clusters 1 and 2consist mainly <strong>of</strong> measurements related to VLDL, and cluster 7 is a small cluster consisting<strong>of</strong> citrate, glycerol, 3-hydroxybutyrate, and acetoacetate. In cluster 5, lactate was the onlymetabolite with positive R 2 (0.115), whereas the remaining members its group could not bepredicted from the genes.For each metabolic cluster, we tabulated the predictive genes indicated by the lasso modelover the cross-validation replications. Genes were selected as stable if they we in the model atleast 60% <strong>of</strong> the cross-validation replications. The stable genes selected <strong>for</strong> each metabolitecluster are shown in Table 7.2. HDC in clusters 1, 2, and 6 (histidine decarboxylase) encodesan enzyme involved in synthesis <strong>of</strong> histamine, which in turn has <strong>wide</strong>-ranging effects acrossthe digestive, neural, and immune systems (Schneider et al., 2002). Genetic variation inFCER1A from cluster 1 (Fc fragment <strong>of</strong> IgE, high affinity I, receptor <strong>for</strong>; alpha polypeptide)has been shown to be associated with serum immunoglobulin E (IgE), a marker <strong>for</strong> allergicdisorders and parasitic exposure (Weidinger et al., 2008). ABCA1 in cluster 1 (ATP-bindingcassette, sub-family A (ABC1), member 1) is known to be responsible <strong>for</strong> cholesterol transportfrom peripheral cells back to the liver (Oram and Lawn, 2001), thereby reducing cholesterollevels in the body, and may also contribute to the processes <strong>of</strong> atherosclerosis, apoptosis,and inflammation (Soumian et al., 2005). MS4A3 in clusters 1 and 2 (membrane-spanning 4-domains, subfamily A, member 3 (hematopoietic cell-specific)), is responsible <strong>for</strong> cell signallingin the cell cycle mechanism <strong>of</strong> hematopoietic cells (Kutok et al., 2011). SLC25A20 in cluster 7(solute carrier family 25 (carnitine/acylcarnitine translocase), member 20) is a gene encoding<strong>for</strong> an enzyme involved in transporting long-chain fatty acids into the mitochondria (Iacobazziet al., 2004). CPT1A in cluster 7 (carnitine palmitoyltransferase 1A (liver)) encodes <strong>for</strong> anenzyme that initiates the oxidation <strong>of</strong> long-chain fatty acids in the mitochondria (Bonnefontet al., 2004). Currently, there is no known function <strong>for</strong> SNORD13 from clusters 1 and 2(small nucleolar RNA, C/D box 13) and <strong>for</strong> C21orf7 in cluster 1.7.3.2. Integrating the Metabolite Models with Models <strong>of</strong> Gene Expression basedon SNPsHaving developed predictive models <strong>of</strong> metabolites from gene expression, we turned to modellinggene expression itself using genetic variation in the <strong>for</strong>m <strong>of</strong> SNPs, to assess whether any<strong>of</strong> the genes that were predictive <strong>of</strong> metabolite levels were themselves under genetic control.Detecting Genetically Regulated Genes For this <strong>analysis</strong>, we selected genes based on theircontribution to the metabolite models. We ranked the metabolite models in descending order<strong>of</strong> R 2 , and used a cut<strong>of</strong>f <strong>of</strong> R 2 = 0.1 to select gene probes belonging to these metabolite155


7. Genetic Control <strong>of</strong> the Human Metabolic Gene Regulation0 10 20 30 401 2 3 4 5 67 8FAw79StoFACH2inFACH2toDB HDL3_CL_VLDL_FCL_VLDL_PLL_VLDL_PL_VLDL_TGL_VLDL_L XL_VLDL_TGXL_VLDL_PLXL_VLDL_LXL_VLDL_P XXL_VLDL_PXXL_VLDL_PLXXL_VLDL_LXXL_VLDL_TGS_HDL_TGXS_VLDL_TGSerum_TGVLDL_TG_eFRS_VLDL_TGS_VLDL_PTGtoPGVLDL_DM_VLDL_TGVLDL_TGM_VLDL_LM_VLDL_PL_VLDL_CL_VLDL_CEM_VLDL_CEM_VLDL_CM_VLDL_FCM_VLDL_PLGlycoprotein acetyls, mainly a1−acid glycoproteinPhenylalanineMUFAFAw79STotFACH2 groups <strong>of</strong> mobile lipidsCH3 groups <strong>of</strong> mobile lipidsDouble bond protons <strong>of</strong> mobile lipidsIDL_TGXS_VLDL_PLXS_VLDL_LXS_VLDL_P IDL_C_eFRS_VLDL_CS_VLDL_FCS_VLDL_PLS_VLDL_LApoBApoBtoApoA1ValineIsoleucineLeucineGlucoseTyrosineCreatinineUreaGlutamineAlbHistidineFAw6LASMIDL_PLIDL_LIDL_PIDL_FCIDL_CFreeCS_LDL_PS_LDL_CS_LDL_LL_LDL_CL_LDL_CEL_LDL_PLL_LDL_LL_LDL_PM_LDL_PLM_LDL_LM_LDL_PLDL_CM_LDL_CM_LDL_CESerum_CEstCL_LDL_FCLDL_C_eFRotPUFAFAw3DHAXL_HDL_TGApoA1TotPG PCS_HDL_LS_HDL_PM_HDL_CM_HDL_CEM_HDL_FCM_HDL_PLM_HDL_LM_HDL_PLactateAlaninePyruvateAcetateLDL_DFAw6toFACitrateGlycerol3−hydroxybutyrateAcetoacetateHDL_CHDL2_C HDL_DL_HDL_CL_HDL_CEL_HDL_FCL_HDL_PLL_HDL_LL_HDL_PXL_HDL_CXL_HDL_CEXL_HDL_PLXL_HDL_FCXL_HDL_LXL_HDL_PFALenFAw3toFABIStoDBDBinFABIStoFA156Figure 7.6.: Hierarchical clustering <strong>of</strong> the metabolites, using complete linkage.


7.3. Results0.15●R 20.100.05●●●●●0.00●1 2 3 4 5 6 7 8ClusterFigure 7.7.: Box-and-whisker plots <strong>of</strong> R 2 in model 2 each metabolite cluster, predictingmetabolite concentrations from gene expression.ClusterStable genes1 SNORD13, HDC, C21orf7, ABCA1, FCER1A, MS4A32 HDC, SNORD13, MS4A33 -4 -5 -6 HDC7 SLC25A20, CPT1A8 -Table 7.2.: The stable predictive genes selected <strong>for</strong> each metabolic cluster (appeared in thelasso model ≥ 60% <strong>of</strong> the cross-validation replications). “-” indicates that no geneswere stably selected in this cluster.157


7. Genetic Control <strong>of</strong> the Human Metabolic Gene Regulationmodels, leaving 44 unique probes, which we subsequently used as candidates in the eQTL<strong>analysis</strong>.We used lasso linear regression to separately regress each <strong>of</strong> the selected genes on all SNPs,after having removed the effect <strong>of</strong> age and gender from the gene expression using unpenalisedlinear regression, as these factors are known to affect gene expression and may confound the<strong>analysis</strong>. For each gene, we per<strong>for</strong>med 30 × 10-fold cross-validation, and estimated the crossvalidatedR 2 <strong>for</strong> each model, across a range <strong>of</strong> model sizes, and selected the model with thebest cross-validated R 2 .Figure 7.8 shows R 2 <strong>for</strong> the predicting the gene expression from the SNPs, <strong>for</strong> each metabolite,where the metabolite to gene association was stable, as discussed in Section 7.3.1. Alsoshown are the R 2 aggregated over all metabolites (“All”) and over a set <strong>of</strong> random genes.For the random genes, we randomly selected 2000 genes not in the metabolic list, and removedgenes having correlation R ≥ 0.5 with any gene stably selected <strong>for</strong> the metabolites,leaving 1771 random genes. The random set <strong>of</strong> genes was used as negative control. The R 2are highly skewed — most genes cannot be predicted from the SNPs, however, a small minorityhas substantial R 2 . Many metabolites are associated with the gene TFG which is highlypredictable from the SNPs (R 2 = 0.741). However, there was no evidence <strong>for</strong> systematicallydifferent R 2 between the two sets <strong>of</strong> genes, metabolic versus random (p = 0.613 from two-sidedKolmogorov-Smirnov test that the two R 2 samples come from the same distribution). Thefour genes <strong>for</strong> isoleucine had the highest median R 2 , with R 2 = 0.495 and R 2 = 0.105 <strong>for</strong>ANKRA2 and TMEM140, respectively, and R 2 = 0 <strong>for</strong> the two other genes associated with it(MS4A3 and HDC ).Edge Orientation and Hypotheses <strong>of</strong> Regulation For each metabolite, we used the stablepredictive genes and their stable cis-QTL to infer causal networks using partial correlation.The inferred causal network <strong>for</strong> serum triglycerides (Serum TG) is shown in Figure 7.9. Usingthree cis-QTLs <strong>for</strong> TFG (rs13059686, rs591728, rs544500) as causal anchors, we orientedTFG as causal <strong>of</strong> serum triglyceride levels. In contrast, while ZYG11B is stably associatedwith serum triglycerides and has several associated QTLs, none <strong>of</strong> the SNPs were found to becis-QTLs, and the orientation <strong>of</strong> the edge between the gene and serum triglycerides remainsambiguous.7.3.3. Linking the Causal Networks to Fasting Glucose Levels and Type 2DiabetesLevels <strong>of</strong> blood metabolites, particularly <strong>of</strong> the amino acids isoleucine, leucine, valine, tyrosine,and phenylalanine, have previously been associated with risk <strong>of</strong> type 2 diabetes (Wang et al.,2011). Our data does not include type 2 diabetes (T2D) status, however, it does includefasting glucose (FG) levels. FG levels in the blood are commonly used to test <strong>for</strong> T2D status,158


7.3. ResultsR 20.70.60.50.40.30.20.10.0No. <strong>of</strong> stable genes1024512256128643216842AcetoacetateApoBtoApoA1BIStoDBBIStoFACH2 groups <strong>of</strong> mobile lipidsCH2toDBCH3 groups <strong>of</strong> mobile lipidsDBinFAFALenFAw6FAw6toFAFAw79SFAw79StoFAGlycoprotein acetyls, mainly a1−acid glycoproteinIDL_C_eFRIDL_PLIDL_TGIsoleucineLactateL_VLDL_CL_VLDL_CEL_VLDL_FCL_VLDL_LL_VLDL_PL_VLDL_PLL_VLDL_TGMUFAM_VLDL_CM_VLDL_CEM_VLDL_FCM_VLDL_LM_VLDL_PM_VLDL_PLM_VLDL_TGPyruvateSerum_TGS_HDL_TGS_VLDL_LS_VLDL_PS_VLDL_PLS_VLDL_TGTGtoPGTotFAVLDL_DVLDL_TGVLDL_TG_eFRXL_VLDL_LXL_VLDL_PXL_VLDL_TGXS_VLDL_LXS_VLDL_PXS_VLDL_TGXXL_VLDL_LXXL_VLDL_PXXL_VLDL_TGAllRandomMetabolite●●●●●● ●●● ●●●●● ●●●●●●●● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ●●●● ●●●●●●●●● ●●● ●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●● ●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●Figure 7.8.: Box-and-whisker plots <strong>of</strong> cross-validated R 2 <strong>for</strong> the stable genes associated witheach metabolite (predicted from the SNPs), compared with an aggregation <strong>of</strong> allmetabolites (“All”) and a random set <strong>of</strong> genes (“Random”). Also shown is thenumber <strong>of</strong> stable genes associated with each metabolite. 159


7. Genetic Control <strong>of</strong> the Human Metabolic Gene Regulationrs4926928rs2892800rs1125469rs5148810.130.040.030.05rs11208591rs5158570.13 0.130.04rs109132450.03ZYG11B (0.12)rs60765200.030.040.03rs108076040.02rs978026rs10503719Serum_TG (0.19)rs12495023rs660657rs13059686 0.720.02 0.01TFG (0.74)0.270.03rs591728rs14850030.050.050.35rs544500rs168752900.050.020.360.01rs4273418rs1823870rs2046227rs2419324Figure 7.9.: Inferred network <strong>of</strong> regulation <strong>for</strong> serum triglycerides. Inferred causal edges areshown as solid edges. Dashed edges represent trans-QTLs, where direct causaleffect on the gene cannot be inferred. The edges widths are proportional to the R 2<strong>of</strong> the marginal association between the nodes from a univariable linear regression(in parentheses).160


7.3. Resultswith FG levels <strong>of</strong> 3.6–6.0 mmol/L considered healthy (n = 331 in the data), 6.1–6.9 mmol/Lindicating pre-diabetes (n = 148), and 7.0 mmol/L and higher indicating T2D (n = 30).Glucose was also included as one <strong>of</strong> the metabolites measured by NMR, and was highlycorrelated with the clinical FG levels (r = 0.93). However, NMR glucose was not predictablefrom the genes nor from the SNPs using lasso linear models (negative R 2 in both cases, resultsnot shown). The lack <strong>of</strong> predictive ability by genes or SNPs also means that we cannot infercausal networks <strong>for</strong> FG, as we did with other metabolites.The DILGOM study is likelyunderpowered to detect signals <strong>for</strong> fasting glucose, and current GWAS <strong>of</strong> fasting glucosehave sample sizes in the tens <strong>of</strong> thousands (Dupuis et al., 2010). Instead, we investigatedassociations between FG and the metabolites, including the amino acids indicated by Wanget al. (2011). We used two lasso-penalised models, a linear model in the log concentrationlog F G i =p∑j=1x i β i + ɛ i , (7.6)where ɛ i ∼ N (0, σ 2 ) is iid random noise, and a logistic model <strong>for</strong> FG ≥ 7mmol/L (linear inthe log odds)logit(F G i ) =p∑j=1x i β i , (7.7)where logit(r) = log(r/(1 − r)). In both models, we included all metabolites and all clinicalvariables in order to minimise confounding <strong>of</strong> the association between the metabolites <strong>of</strong>interest and fasting glucose levels. For both models, we used lasso penalisation to select therelevant variables, within 50 × 5 × 10-fold nested cross-validation. We considered metabolitesthat were selected with non-zero weights in more than 70% <strong>of</strong> the replications as “stable”.Note that there is no intercept in the models as the metabolite concentrations have beenregressed on the clinical variables together with intercept term.All inputs were scaled to zero mean and unit variance, there<strong>for</strong>e, the regression weightsshould be interpreted in units <strong>of</strong> standard deviation <strong>for</strong> each metabolite.For example, aweight <strong>of</strong> exp β = 1.017 in the linear model (7.6) means that an increase in one standarddeviation in the metabolites is associated with a multiplicative increase by a factor <strong>of</strong> 1.017in fasting glucose, such as the increase from 6.0 mmol/L to 6.102 mmol/L. Weights less thanone should be interpreted as negative associations: an increase in metabolite levels decreasesfasting glucose levels in the model in the linear model (Eqn. 7.6). Similarly, <strong>for</strong> the logisticmodel, the weight exp β corresponds to the odds ratio increase <strong>for</strong> each standard deviation inthe input variable.Figure 7.10 shows the metabolites stably associated with fasting glucose, in the linear andlogistic models, correcting <strong>for</strong> the effects <strong>of</strong> clinical variables such as age, gender, insulin levels(known to be highly associated with fasting glucose levels), cholesterol lowering medication,and blood pressure. The appearance <strong>of</strong> amino acids such as isoleucine, tyrosine, valine, and161


7. Genetic Control <strong>of</strong> the Human Metabolic Gene RegulationXL_VLDL_PLactate1.0061.007Tyrosine1.009M_VLDL_TG1.017log(Fasting glucose)0.9951.0051.003XL_HDL_TGS_HDL_PValine(a) Linear model <strong>of</strong> log FGLactateAlanine1.311.522Fasting glucose > 71.2990.767XS_VLDL_PLIsoleucine(b) Logistic model <strong>of</strong> FG≥ 7mmol/LFigure 7.10.: Metabolites selected as stably associated with fasting glucose, as selected bylasso regression, correcting <strong>for</strong> the effect <strong>of</strong> the clinical variables. The edgeweights show the exponentiated weights exp(β), corresponding to increases in(a) fasting glucose and (b) odds ratio <strong>of</strong> fasting glucose ≥ 7 mmol/L, respectively,<strong>for</strong> a one standard deviation increase in each metabolite, averaged over the crossvalidationreplications.162


7.3. Resultsalanine in the models <strong>of</strong> FG is consistent with the findings <strong>of</strong> Wang et al. (2011) associatingthese amino acids with increase T2D risk. The metabolites selected in the models largelydiffer, with the exception <strong>of</strong> lactate that appears in both models. We hypothesise that thisdifference may be due to the fact that the linear model attempts to model variation <strong>for</strong>all levels <strong>of</strong> fasting glucose, whereas the logistic model only attempts to discriminate lowfrom high levels. There may be non-linear regulatory effects that occur when fasting glucoseincreases from medium to high levels, but that do not manifest when fasting glucose levels arelow. This hypothesis is supported by the findings <strong>of</strong> a large GWAS <strong>of</strong> T2D risk and fastingglucose by Dupuis et al. (2010), who report that not all the SNPs associated with fastingglucose across the physiological range were also associated with pathological levels <strong>of</strong> fastingglucose and high T2D risk, potentially indicating that levels fasting glucose themselves are notnecessarily indicative <strong>of</strong> high risk, but rather the mechanisms by which fasting glucose levelsare raised may be different between the two regimes. Since we are investigating associations<strong>of</strong> metabolites with fasting glucose, we cannot orient the edges between the metabolites andfasting glucose, as the cis-QTLs can no longer be used as causal anchors here — we do nothave a known direct causal driver <strong>of</strong> either the metabolites or <strong>of</strong> fasting glucose.For some <strong>of</strong> the metabolites we could infer the causal networks that affect them, basedon finding cis-eQTL as causal anchors <strong>for</strong> the genes. Figure 7.11 shows the inferred causalnetwork <strong>for</strong> the metabolites lactate, triglycerides in medium very low density LDL (M VLDLTG), and isoleucine. For lactate, only one stable and causal probe was found (ILMN 1867138,from the RST13952 Athersys RAGE Library), mapping to a region on chr4 with currently noknown gene annotated, and a corresponding cis-QTL rs6852748. For M VLDL TG, causal anchorcis-QTLs were found <strong>for</strong> all four genes. PSMD2 (proteasome (prosome, macropain) 26Ssubunit, non-ATPase, 2) encodes <strong>for</strong> a subunit <strong>of</strong> an enzyme (proteasome) which is involvedin the production <strong>of</strong> major histocompatibility complex class 1 (MHC-1) peptides (Kloetzel,2004), an important step in the cellular immune response. TFG (TRK-fused gene) is involvedin the regulation <strong>of</strong> protein export from the endoplasmic reticulum and has a role inoncogenesis (Hernández et al., 1999; Miranda et al., 2006; Pagant and Miller, 2011). CCS(copper chaperone <strong>for</strong> superoxide dismutase) delivers copper to the enzyme copper/zinc superoxidedismutase. Copper deficiency has been associated with increased risk <strong>of</strong> cardiovasculardisease and diabetes, among others, potentially through the effects <strong>of</strong> increased oxidativestress (Uriu-Adams and Keen, 2005). DGKQ (diacylglycerol kinase, theta 110kDa) encodes<strong>for</strong> intracellular lipid kinases, responsible <strong>for</strong> regulating levels <strong>of</strong> diaglycerol in the cell membrane,and its products are involved in control <strong>of</strong> diverse processes such as lipid metabolism,cell growth, membrane trafficking, cell differentiation, and cell migration (Mérida et al., 2008).ANKRA2 (ankyrin repeat, family A (RFXANK-like), 2), physically binds to class II histonedeacetylases which are enzymes that reduce gene transcription by deacetylating the aminoterminaltails <strong>of</strong> histones, and may be involved in signalling pathways controlling antigen163


7. Genetic Control <strong>of</strong> the Human Metabolic Gene RegulationMetabolite Gene SNP Chr Nearby gene (within 1Mb)M VLDL TGIsoleucinePSMD2 rs288723 13 -PSMD2 rs3805036 3 ITPR1DGKQ rs225320 21 TMPRSS3DGKQ rs2249431 20 SIRPDCCS rs3817625 14 TRA/TCR/TCRVA15 /TCRATFG rs660657 11 FLI1TFG rs4273418 4 -TFG rs2419324 13 ATP8A2TFG rs2046227 3 GPR128TFG rs1823870 2 LTBP1TFG rs16875290 7 GARSTFG rs1485003 7 PDK4TFG rs12495023 3 ST6GAL1TMEM140 rs10403127 19 MED25TMEM140 rs10899261 11 GUCY2ETMEM140 rs1483179 8 NKAIN3TMEM140 rs336384 4 INPP4BTMEM140 rs6836941 4 KIAA0232TMEM140 rs9561879 13 CLDN10ANKRA2 rs1009697 13 DAOAANKRA2 rs17005004 4 C4orf22Table 7.3.: trans-QTLs <strong>for</strong> genes associated with the metabolites predictive <strong>of</strong> fasting glucoselevels.presentations (Mckinsey et al., 2006). Little is known about the role <strong>of</strong> TMEM140 (transmembraneprotein 140); it is hypothesised to be involved in hematopoiesis (Shimizu et al.,2008), and was recently found to be moderately differentially expressed in a case/controlstudy <strong>of</strong> preeclamptic pregnancies (Løset et al., 2011).The trans-QTL SNPs shown are associated with the genes to varying degrees, however wecannot infer whether they are direct regulators <strong>of</strong> these genes or are they instead mediated byother genes or metabolites. The trans-QTL SNPs <strong>for</strong> each gene are shown in Table 7.3. Themost strongly associated trans-QTL <strong>for</strong> TFG, rs2046227, resides in proximity to GPR128(G-protein coupled receptor 128). GPR128 and TFG have been shown to create a fusiontranscript, especially in atypical myeproliferative neoplasms but also less commonly in healthypatients (Chase et al., 2010).Population StructureAlthough the DILGOM dataset was randomly sampled from unrelatedindividuals, hidden population structure may still induce spurious associations in the164


7.3. Resultsrs68527480.1ILMN_1867138 (0.06)0.04Lactate (0.11)(a) Lactaters2272473rs2887230.1rs3805036 0.050.040.13 PSMD2 (0.1)rs6845rs12495023rs8737850.31rs225320rs22494310.020.030.45DGKQ (0.47)0.030.02CCS (0.13)0.02M_VLDL_TG (0.16)rs11724804rs4980450.15rs38176250.05rs660657rs130596860.02 0.020.720.05TFG (0.74)rs14850030.05rs168752900.02 0.360.01rs1823870rs20462270.030.270.05 0.35rs4273418rs2419324rs544500rs591728(b) M VLDL TGrs1009697rs13170849rs17005004rs10403127rs10899261rs1483179rs336384rs4728345rs6836941rs95618790.030.50.030.030.040.040.040.120.040.04ANKRA2 (0.5)TMEM140 (0.1)0.050.11Isoleucine (0.03)(c) IsoleucineFigure 7.11.: Inferred causal networks <strong>for</strong> three metabolites stably associated with fastingglucose levels. The edge weights are the R 2 from a univariable linear regression<strong>of</strong> each child node on each parent node. The R 2 from a multivariable lasso linearregression on all inputs (SNPs <strong>for</strong> genes and genes <strong>for</strong> metabolites) is shown inparenthesis next to each node.165


7. Genetic Control <strong>of</strong> the Human Metabolic Gene RegulationPC 1−0.10 0.05 0.15●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●● ●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●● ●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●−0.15 0.00 0.10●●● ●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●● ●●●●●●●● ●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●−0.15 −0.05 0.05●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ● ●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●−0.10 0.05 0.15●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●● ● ●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●● ●●●●●●● ● ●●●●● ●●●●●●●●●●●●●●●●● ●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ● PC 2●●●●●●●●●●● ●●● ●●●●● ●●● ● ●●●●●● ●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●● ●●● ● ●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●● ● ● ●●●●●●●●●●●● ●●●●●●●● ● ●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ● ●●● ●●●●●●●● ● ●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●●● ● ●●●●●●●●●●●●●●●●●●●●●● ●● ● ●●●●●●●● ●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●● ●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●● ●●●● ●●● ● ●●●●●●●●●●●● ●●●●●●●●● ● ●●●●●●●●●●●●●●● ●●● ●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ● ●●●●●● ● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ● ●●●●●●●● ●● ● ●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●PC 3●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●−0.10 0.00 0.10●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ● ●●●●●●●●● ● ●●●●●●●●●●● ● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ● ●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●● ● ●●●●●●●●●● ●●●●●●●●●●●−0.15 0.00 0.10● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●● ●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●● ●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●● ● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●● ● ●●●●●●●●● ● ●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ● ● ●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●● ● ●● ● ●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ● ●●●●●●●● ● ●● ●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●●●● PC 4 ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●● ● ●●●●●● ●●●● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●● ● ●●●●●●●● ●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●● ●●● ●● ●●●●●●● ● ●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●● ● ●●●●● ●●●●●●●●−0.15 −0.05 0.05●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ● ●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●● ●●● ●● ● ●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●●●●●●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●● ●●● ● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●● ●●●●● ●●●● ●●●● ●●●●●●● ● ● ●● ●●●●●●●● ●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●−0.10 0.00 0.10●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●● ●● ● ●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●● ● ● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●● ● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●● ●●●●●●●●● ● ●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●● ●●●●●●●●●●●●●●−0.10 0.05 0.15−0.10 0.05 0.15PC 5Figure 7.12.: Top 5 principal components from PCA <strong>of</strong> the genotype data.166


7.4. DiscussionProbe Gene λILMN 2341815 TFG 1.0078ILMN 1656676 ZYG11B 1.0006ILMN 1712432 PSMD2 1.0074ILMN 1793017 DGKQ 1.0058ILMN 1766797 CCS 1.0098ILMN 1867138 - 1.0102ILMN 1687351 ANKRA2 1.0030ILMN 1736863 TMEM140 1.0051Table 7.4.: Genomic inflation factors <strong>for</strong> genes associated with metabolites predictive <strong>of</strong> fastingglucose, based on the median χ 2 statistics from the linear model <strong>of</strong> associationin PLINK.data, confounding our <strong>analysis</strong>. We used smartpca from Eigens<strong>of</strong>t 4.0beta (Price et al.,2006) to assess stratification in the genotype data (Figure 7.12). There was no evidence <strong>for</strong>substantial substructure in the genetic data. We also estimated genomic inflation factors λ<strong>for</strong> the genes implicated in the networks <strong>for</strong> lactate, M VLDL TG, and isoleucine. We usedthe median χ 2 statistic in the linear regression test in PLINK (Purcell et al., 2007) to check<strong>for</strong> possible hidden population structure in the genotype data (Table 7.4), using all SNPs inthe data. There was no evidence <strong>for</strong> substantial inflation <strong>of</strong> the test statistics <strong>for</strong> the genesanalysed.7.4. DiscussionIn this chapter we have analysed the DILGOM dataset, composed <strong>of</strong> gene expression, SNPs,and metabolite data, assayed from a random sample <strong>of</strong> a Finnish population. Our aims wereto estimating how much <strong>of</strong> the variation in metabolite levels could be attributed to geneexpression after accounting <strong>for</strong> known clinical factors, to assess how much <strong>of</strong> the predictivegenes could themselves be predicted from SNPs, and to infer causal pathways regulating thesemetabolites.This work expands on previous analyses <strong>of</strong> the same data (Inouye et al., 2010a,b), thatexamined the lipid-leukocyte (LL) gene module, and inferred causal networks <strong>of</strong> interactionwith other genes and metabolites. Using lasso penalised linear models, we have uncovered<strong>genome</strong>-<strong>wide</strong> expression associated with metabolite levels, and <strong>for</strong> these predictive genes wethen discovered highly associated SNPs. Using cis-QTLs as causal anchors as the basis <strong>for</strong> apartial correlation <strong>analysis</strong>, we oriented the edges <strong>of</strong> the SNP-gene-metabolite networks, inferringcausal pathways <strong>of</strong> genetic regulation <strong>of</strong> gene expression associated with serum metabolitelevels. Three metabolites — lactate, isoleucine, and triglycerides in medium VLDL (M VLDLTG) — were stably associated with fasting glucose levels, a commonly-used clinical marker167


7. Genetic Control <strong>of</strong> the Human Metabolic Gene Regulation<strong>for</strong> type 2 diabetes. These results suggest potential causal mechanisms that contribute to thegenetic basis <strong>of</strong> type 2 diabetes, mediated by gene expression and metabolites. Specifically,we inferred causal networks <strong>for</strong> isoleucine, known to be associated with future type 2 diabetesstatus (Wang et al., 2011). We reproduced the association <strong>of</strong> isoleucine with fasting glucosein our data. The causal network <strong>for</strong> isoleucine included two <strong>of</strong> the genes associated with it(ANKRA2 and TMEM140 ) that themselves were shown to be under strong genetic controlby cis-QTLs. This causal network represents a novel hypothesis <strong>of</strong> the gene pathways mediatingthe observed effect <strong>of</strong> SNPs on fasting glucose levels and on risk <strong>of</strong> type 2 diabetes, andfurther elucidates the genetic basis <strong>for</strong> type 2 diabetes.In addition, this work demonstrates the increased insights gained by employing multipledata types, each examining a different aspect <strong>of</strong> the samples. While associations betweengenes and metabolites were in<strong>for</strong>mative, only through the use <strong>of</strong> SNP data, and <strong>of</strong> cis-QTLsspecifically, were we able to orient the causal edges <strong>of</strong> the graph and to generate plausiblehypotheses <strong>of</strong> genetic regulation mediated by gene expression. Besides the cis-QTLs, thetrans-QTLs provide additional in<strong>for</strong>mation about the regulatory pathways, such as the associationbetween genetic variation in GPR128 and the gene expression <strong>of</strong> TFG, which we havepredicted to be a causal factor <strong>of</strong> levels <strong>of</strong> triglycerides in medium VLDL.The integrative <strong>analysis</strong> presented in this chapter was possible due to scalable methods <strong>for</strong>fitting lasso penalised linear models to <strong>genome</strong>-<strong>wide</strong> data. This work has demonstrated boththe practical utility and feasibility <strong>of</strong> such models <strong>for</strong> detection <strong>of</strong> gene-metabolite and SNPgeneassociations. In addition, the use <strong>of</strong> multi-omic datasets (genes, SNPs, and metabolites)makes it possible to generate more detailed hypotheses <strong>of</strong> biological mechanisms <strong>of</strong> cellularregulation, potentially leading to better understanding <strong>of</strong> the causal structure <strong>of</strong> disease development.Integration <strong>of</strong> yet more data types, such as epigenetic markers or structural dataas such as copy number variation, will likely lead to even better understanding <strong>of</strong> diseaseetiology.LimitationsThis work has several limitations. First, we have used gene expression from lymphocytes inwhole blood, which are likely different from the expression levels <strong>of</strong> the same genes in thetissue responsible <strong>for</strong> metabolism, <strong>for</strong> example the liver; however samples such as this would bedifficult to obtain due to technical and ethical challenges posed by taking biopsies from healthypatients. This means that important associations may not have been found. Second, we onlyemployed linear models and did not model epistatic interactions between genes and genes,genes and metabolites, SNPs and SNPs, or combinations <strong>of</strong> these factors. Interactions cannotbe ruled out, although Surakka et al. (2011) found only one such interaction, between waisthipratio and rs6448771, with total cholesterol (TC), as a phenotype, which was statisticallysignificant but explains less than 0.5% <strong>of</strong> the phenotypic variance <strong>of</strong> TC. Third, we did not168


7.4. Discussionattempt to infer causal graph <strong>of</strong> all genes related to a given metabolite, but rather onlyconsidered each gene separately. Fourth, we focused on the genes that were predictable fromSNPs; we did not examine genes that were highly <strong>of</strong> metabolites levels but could not bepredicted from SNPs. Nor did we examine all SNPs potentially associated with metabolites,but filtered the SNPs based on association with predictive genes. In other words, there may yetbe SNPs predictive <strong>of</strong> metabolite levels but that are not associated with the gene expressionassayed in this data. Fifth, further external validation <strong>of</strong> our findings in independent multiomicdatasets is required in order to confirm their robustness and allay concerns <strong>of</strong> confoundingdue to unknown factors. Sixth, larger sample sizes are likely to be needed to achieve enoughstatistical power to detect weaker associations that could be present in this dataset but wereundetected in this <strong>analysis</strong>. Sixth, moving beyond linear models to more flexible models suchas trees (Random Forests or boosting) would potentially allow detection <strong>of</strong> non-linear effectssuch as gene-gene interactions that could further enhance prediction ability and may providea more realistic description <strong>of</strong> the underlying complex biology. Finally, other <strong>approaches</strong><strong>for</strong> causal structure inference (Aten et al., 2008) could potentially allow us to infer morecomplicated causal structures than we have here, including those involving multiple genesinteracting with each other (a global directed network). Inference <strong>of</strong> the entire network<strong>of</strong> SNPs, genes, and metabolites would potentially provide better elucidation <strong>of</strong> the causalmechanisms underlying genetic regulation <strong>of</strong> metabolism and associated phenotypes such astype 2 diabetes.169


8Fused Multitask Penalised RegressionIn Chapter 7 we analysed a multi-omic dataset consisting <strong>of</strong> SNPs, gene expression, andmetabolite. We used multiple inputs <strong>for</strong> each model (multiple SNPs per gene, multiple genesper metabolite), but we only used one output: each phenotype was considered independently.However, many phenotypes, <strong>for</strong> example metabolite levels and gene expression levels, areclearly not independent: there exist strong correlations within the phenotypes. These correlationscould potentially be used to build better predictive models. In this chapter we proposefmpr, a statistical method <strong>for</strong> taking advantage <strong>of</strong> the inter-dependencies between the multiplephenotypes, or tasks, <strong>for</strong> use in gene expression or genetic datasets. In simulation, weshow that our approach induces better predictive models than other <strong>approaches</strong> that do notconsider the task relatedness, such as the lasso, and in some cases better than the graphguided fused lasso that accounts <strong>for</strong> task relatedness. In <strong>analysis</strong> on the DILGOM dataset,fmpr achieves better predictive ability than the lasso.8.1. IntroductionHigh throughput technologies, such as gene expression microarrays and single nucleotide polymorphism(SNP) microarrays, have made it possible to assay thousands and even millions <strong>of</strong>potential biological markers, in order to detect associations with clinical phenotypes such asdisease. At the same time, the definition <strong>of</strong> what is a phenotype has become more encompassing.Rather than considering only macro-level clinical phenotypes such as presence <strong>of</strong>disease, other lower-level (intermediate) phenotypes, such as gene expression and metabolite171


8. Fused Multitask Penalised RegressionG1G2G3G4G5M2M1M4M5M3Figure 8.1.: An illustration <strong>of</strong> a hypothetical setup in which five genes G 1 , ..., G 5 are associatedwith five metabolites M 1 , ..., M 5 . Several metabolites share the same geneassociations (solid lines), and there<strong>for</strong>e are correlated with each other (correlationshown by dashed lines). By leveraging the inter-metabolite correlations, multitaskmethods such as fmpr and GFlasso aim to better identify which inputs(genes in this case) are truly associated with which outputs (metabolites), whileavoiding spurious associations due to effects such as noise, under the assumptionthat correlated outputs are caused by common regulators (pleiotropic genes).levels, are becoming <strong>of</strong> interest. For example, it is now routine to scan SNPs <strong>for</strong> potentialexpression quantitative loci (eQTL), regulating the expression <strong>of</strong> genes (Mackay et al., 2009),or the levels <strong>of</strong> serum metabolites (Inouye et al., 2010a,b; Tukiainen et al., 2011). With theseanalyses there can be hundreds to thousands <strong>of</strong> phenotypes rather than the single binarystatus phenotype commonly found in case/control studies.The simplest approach to analysing these multiple phenotypes is to consider each oneseparately, essentially ignoring their inter-dependencies. However, many <strong>of</strong> these phenotypesare related to each other, and in<strong>for</strong>mation may be gained from these relationships. Forexample, the expression <strong>of</strong> some genes can be highly correlated if they are regulated by thesame transcription factor (TF), and a genetic variation at that TF will potentially influenceall <strong>of</strong> them (so called pleiotropy). There<strong>for</strong>e, by considering all phenotypes concurrently,and explicitly accounting <strong>for</strong> the correlations between them, there is potential <strong>for</strong> betterstatistical models, and <strong>for</strong> better understanding <strong>of</strong> the underlying biological processes thatmay be common to these phenotypes.172


8.2. BackgroundRecently, Kim and Xing (2009) proposed the graph-guided fused lasso (GFlasso), a statisticallyprincipled approach <strong>for</strong> analysing datasets with multiple related phenotypes, under theassumption that correlated phenotypes are driven by similar underlying causal mechanisms,such as common SNPs affecting the expression <strong>of</strong> several genes or common pleiotropic genesaffecting metabolite levels (Figure 8.1). GFlasso per<strong>for</strong>ms variable selection through thelasso penalty on the model fit <strong>for</strong> a given phenotype, influenced by the weights <strong>of</strong> the otherphenotypes correlated with this phenotype. The differences between the weights <strong>for</strong> the samemarker in the different phenotypes (tasks) are themselves penalised using a lasso-type fusionpenalty, which tends to induce models where the same input (SNP in this case) is selectedto be in the model across all phenotypes, while allowing <strong>for</strong> the weights to vary betweenphenotypes.We propose an alternative approach, termed Fused Ridge Multi-task Regression (fmpr),that essentially replaces the fused lasso term with a fused ridge term, and present an efficientalgorithm based on coordinate descent to optimise the fmpr loss function. In this chapter,we investigate the per<strong>for</strong>mance <strong>of</strong> fmpr compared with GFlasso and several other methods,on simulated data and on the DILGOM data.8.2. BackgroundThe lasso is a useful method <strong>for</strong> variable selection and shrinkage in the single-task setting,defined as the solution to the l 1 penalised least squares problemβ ∗ = arg minβ∈R12N∑i=1p(y i − x i β) 2 + λ ∑ ∣β j ∣, (8.1)where y i ∈ R and x i ∈ R p are the ith output and input, respectively, N is the number <strong>of</strong>samples, and β j ∈ R is the model weight (regression coefficient). In this <strong>for</strong>mulation and allfollowing loss functions, we assume that the inputs x and outputs y are standardised to zeromean and unit variance, so we do not include an intercept term in the model. An alternativeapproach to the lasso is ridge regression (l 2 penalty),β ∗ = arg minβ∈R12N∑i=1j=1p(y i − x i β) 2 + λ ∑ βj 2 . (8.2)It is well known that the lasso penalisation induces sparse models — models where manyweights are exactly zero — whereas the ridge penalisation does not, in general (Hastie et al.,2009a; Tibshirani, 1996). The lasso per<strong>for</strong>ms both variable selection, setting the irrelevantvariables to zero, and shrinkage (reduction <strong>of</strong> the weights towards zero), whereas ridge regressiononly tends to shrink the weights. A penalisation scheme that combines the two penaltiesj=1173


8. Fused Multitask Penalised Regressionis the naïve elastic net (Zhu and Hastie, 2005)β ∗ = arg minβ∈R12N∑i=1p(y i − x i β) 2 + αλ ∑ ∣β j ∣ + (1 − α)λ ∑ βj 2 . (8.3)Through the tuning parameter α ∈ [0, 1], a compromise between the ridge regression andlasso can be achieved, and the number <strong>of</strong> variables that can enter the model is larger thanthe number <strong>of</strong> observations (unlike the lasso).The lasso, ridge regression, and elastic net are single-task methods, in that they consideronly one output vector y at a time. Multiple tasks can be modelled by fitting a separatemodel to each task, however, none <strong>of</strong> these methods take into account any in<strong>for</strong>mation fromthe other tasks when fitting a model <strong>for</strong> a given task.The graph-guided fused lasso (GFlasso) (Kim and Xing, 2009) is an extension <strong>of</strong> thelasso, applied to multiple tasks concurrently, and employs the fusion penalty to selectivelymerge together the weights <strong>of</strong> K related outputs. The motivation is to borrow power acrossmultiple outputs, such that similar outputs (phenotypes) will tend to have similar inputs(such as SNPs) as non-zero, thus tending to select the same inputs across all outputs. TheGFlasso <strong>for</strong> linear loss is <strong>for</strong>mulated asB ∗ 1= arg minB∈R p×K 2K∑k=1+ γ ∑(m,l)∈ENj=1∑ ∣∣y ik − x T i β k ∣∣ 2 2 + λi=1pK∑k=1ppj=1∑ ∣β jk ∣j=1f(r ml ) ∑ ∣β jm − sign(r ml )β jl ∣, (8.4)where ∣∣z∣∣ 2 2 = ∑N i=1 z2 i is the squared l 2 -norm, B = [β 1 , ..., β K ] ∈ R p×K is the matrix <strong>of</strong> weights<strong>for</strong> all tasks, f(r ml ) is a function monotonic in the absolute value <strong>of</strong> the Pearson correlationr ml between the mth and lth phenotypes, and E is the set <strong>of</strong> inter-task edges. The edges canbe induced by thresholding the Pearson correlation r ml or they can simply be the set <strong>of</strong> alledges. The set <strong>of</strong> edges E is assumed to be identical <strong>for</strong> all p variables, meaning that thedegree <strong>of</strong> task relatedness does not vary across the variables.j=18.3. MethodsOur aim is to develop a method <strong>for</strong> modelling multiple related outputs, in a computationallyefficient and thus scalable way.8.3.1. Fused Multitask Penalised RegressionUnlike GFlasso which uses an l 1 fusion penalty, fmpr uses the l 2 norm to shrink thedifferences between them. As discussed below, the change in penalisation has implications174


8.3. Methodsβ−0.20 −0.10 0.00 0.05 0.10 0.151e+00 1e+02 1e+04 1e+06γFigure 8.2.: The solution path <strong>of</strong> fmpr <strong>for</strong> one parameter β j over K = 10 tasks, <strong>for</strong> increasingγ and with λ = 0.both in terms <strong>of</strong> recovery <strong>of</strong> the true non-zero inputs, and <strong>for</strong> optimising the loss function.We <strong>for</strong>mulate the fused ridge penalised squared loss asB ∗ 1= arg minB∈R p×K 2+ γ 2K∑m=1K∑k=1KN∑(y ik − x T i β k ) 2 + λi=1K∑k=1p∑ ∣β jk ∣j=1∑ f(r ml ) ∑[β jm − sign(r ml )β jl ] 2 . (8.5)l=1pj=1This problem, like the GFlasso, is a convex optimisation problem (see, <strong>for</strong> example, (Boydand Vandenbergh, 2004)). The λ penalty tunes sparsity within each task (lasso). The γpenalty shrinks the differences between weights <strong>for</strong> related tasks towards zero, but unlikethe GFlasso, does not necessarily encourage sparsity in differences between the weights <strong>for</strong>related tasks. The effect <strong>of</strong> the fusion penalty can be seen in Figure 8.2, where the weights<strong>for</strong> a single parameter β 1 over K = 10 tasks are shown <strong>for</strong> increasing values <strong>of</strong> γ. The fusionpenalty smoothly encourages the weights across the tasks to be more similar to each other,until they are identical. The fmpr loss is a equivalent to the lasso when γ = 0.175


8. Fused Multitask Penalised RegressionWhile the fused ridge approach has been presented here <strong>for</strong> linear loss, the same methodcan be applied to other loss functions such as logistic and squared hinge loss <strong>for</strong> classification.A crucial factor in multi-task methods such as fmpr and GFlasso is the definition <strong>of</strong> taskrelatedness, which is the basis <strong>for</strong> the fusion penalty. The simplest approach is to thresholdthe Pearson correlation in absolute value, thus only considering as edges correlations with highenough magnitude. However, thresholding has two major disadvantages that greatly limit itsusefulness in practice. First, the threshold is arbitrary, and must be manually tuned <strong>for</strong> eachset <strong>of</strong> inputs. Second, the thresholding is inherently a binary operation, and any inputs withcorrelation slightly below the cut<strong>of</strong>f will be considered as completely unrelated. Determiningthe correct threshold is especially problematic when the data are noisy, as is typically the casein many genomic experiments, in which case the correlation between outputs will be subjectto random fluctuations as well. An alternative approach to thresholding is weighting, wherethe magnitude <strong>of</strong> the correlation defines task relatedness in a continuous fashion, without theneed <strong>for</strong> an arbitrary cut<strong>of</strong>f. Useful weighting functions explored by Kim and Xing (2009)include f(r) = r 2 and f(r) = ∣r∣, and we will use these as well. The weighting functions aremonotonic in ∣r∣, so that negative correlation has the same magnitude <strong>of</strong> effect as positivecorrelation, except that weights are encouraged to be dis-similar rather than similar.8.3.2. ImplementationWe use cyclical coordinate descent (Friedman et al., 2010) to minimise the fused ridge loss,as outlined in Algorithm 2.Coordinate descent iterates over all variables, one at a time,per<strong>for</strong>ming a univariable Newton step with respect to the variable being updated.especially efficient when <strong>for</strong> fitting sparse models since variables that are not in the modeldo not need to be iterated over in each iteration. Like the lasso and the ridge regression, thegraph-guided fused lasso and the fused ridge are convex problems, and can be solved usingstandard tools from convex optimisation. However, unlike the lasso or the fused ridge whichcan be solved efficiently with coordinate descent, the lasso fusion loss cannot be minimised bycoordinate descent in its original <strong>for</strong>m, as the minimisation process may result in suboptimalsolutions, getting stuck in non-smooth corners <strong>of</strong> the loss function (Friedman et al., 2007), dueto the non-separability <strong>of</strong> the penalty (Tseng, 2001). Consequently, other methods have beenproposed, such as the smoothing proximal gradient method (Chen et al., 2012) or reweighted-l 2 methods (Bach et al., 2011; Kim and Xing, 2009).In the coordinate descent procedure, we iterate over each variable j = 1, ..., p in each taskk = 1, ..., K, taking a Newton steps jk = β jk − ∂Lβ jk/ ∂2 L∂β 2 jkIt is, (8.6)176


8.4. Simulationwhere L is the loss function. For linear loss (least squares regression), the first partial derivativesare∂L∂β jk=and the second partial derivatives areN∑i=1x ij (x T i β k − y ik ) (8.7)∂ 2 L∂β 2 jk=N∑i=1x 2 ij. (8.8)The partial derivatives <strong>of</strong> the fusion penalty Ω areand∂Ω∂β jk=K∑l=1f(r kl )(β jk − sign(r kl )β jl ) (8.9)∂ 2 Ω∂β 2 jk=K∑l=1f(r kl ). (8.10)In a fashion similar to the implementation <strong>of</strong> the elastic net method (Zhu and Hastie, 2005)(Eqn. 8.3), fmpr implements the combined penalties by first computing the Newton step, thenapplying the fusion penalty to arrive at a penalised Newton step, and finally s<strong>of</strong>t-thresholdingthe penalised Newton step to achieve sparsity through the lasso penalty.8.3.3. Computational EnhancementsWe employ active set convergence (Friedman et al., 2010) to speed up convergence. Briefly, webegin by iterating over all variables. Any variable that becomes zero is declared inactive, andremoved from the active set. Once all variables have converged, as determined by absoluteconvergence <strong>of</strong> the loss, we iterate over all variables again. If the active set remains the same,then the algorithm terminates. Otherwise, all variables are added to the active set and weiterate over them as be<strong>for</strong>e.In addition, we organise the computation such that <strong>for</strong> each γ, fmpr computes the solutionsover a grid <strong>of</strong> increasing λ penalties. The smallest λ is user defined, but the largest λ is definedas the penalty making all model weights zero. Thus, if at any stage during the path all weightsbecome zero, then we can terminate the path early, since increasing λ will necessarily resultin all weights staying zero, and the weights <strong>for</strong> these models do not need to be computedexplicitly.8.4. SimulationWe compared four penalised regression <strong>approaches</strong>:177


8. Fused Multitask Penalised Regressionwhile not converged do<strong>for</strong> k = 1, ..., K do<strong>for</strong> j = 1, ..., p do// derivativesd 1 ← ∂L∂β jkd 2 ← ∂2 L∂β 2 jk// inter-task fusion penaltyd 1 ← d 1 + γ ∑ K l=1 f(r kl)(β jk − sign(r kl )β jl )d 2 ← d 2 + γ ∑ K l=1 f(r kl)// Newton steps jk ← β jk − d 1 /d 2// lasso s<strong>of</strong>t thresholdingβ jk ← { 0 if ∣s jk∣ ≤ λ,s jk − λ sign(s jk ) otherwise.endendendAlgorithm 2: An outline <strong>of</strong> the coordinate descent algorithm <strong>for</strong> minimising the fusedridge penalised loss function. We assume that each input x j and output y k is standardisedto zero-mean and unit variance. f(r ml ) is a monotonic function in the absolutevalue <strong>of</strong> the correlation r, such as ∣r∣ or r 2 .178


8.4. Simulation• The fused ridge method fmpr, weighted by the correlation <strong>of</strong> the outputs (fmpr-w1<strong>for</strong> ∣r∣ and fmpr-w2 <strong>for</strong> r 2 );• GFlasso; we used a C implementation <strong>of</strong> the MATLAB tool spg (Chen et al., 2012) 1 ,weighted by the output correlation (GFlasso-w1 <strong>for</strong> ∣r∣ and GFlasso-w2 <strong>for</strong> r 2 );• Ridge regression;• Lasso, using glmnet (Friedman et al., 2010);• Naïve elastic net using glmnet.The first two methods take into account the task relatedness. The remaining three do not,in that the weights <strong>of</strong> the K tasks are completely unrelated to each other; this is equivalentto K separate regressions. We did not include thresholded versions <strong>of</strong> fmpr and GFlasso,where the graph is cut based on correlation threshold, since the threshold is set arbitrarily,and in may result in meaningless graphs, specially over cross-validation replications where thecorrelation between phenotypes varies randomly. Kim and Xing (2009) found the weightedversions <strong>of</strong> GFlasso to consistently outper<strong>for</strong>m the thresholded version in simulation andon real data.We evaluated the models in terms <strong>of</strong> recovery <strong>of</strong> the causal input variables, and in terms<strong>of</strong> predictive ability:• Recovery is binary classification <strong>of</strong> whether the true weights are non-zero or zero, andis measured using precision/recall and ROC curves. We used 5-fold cross-validationto find the optimal hyperparameters. Then the optimal hyperparameters are used <strong>for</strong>fitting the model to the entire dataset. The absolute values <strong>of</strong> the estimated modelweights ∣β∣ are compared against the zero pattern <strong>of</strong> the true weights, with non-zeroweights taken as the positive class. R package ROCR (Sing et al., 2005) to estimatethese curves.• Predictive ability is measured as R 2 in cross-validation. For all methods, we used nestedcross-validation, where the data are split into 90%/10%. The 90% are used <strong>for</strong> 5-foldcross-validation, used to find the optimal hyperparameters. Then we fit the model usingthe optimal hyperparameters to the entire 90% subset, and test it on the previouslyunseen 10%. Nested cross-validation eliminates the optimisation bias inherent in usingstandard cross-validation to both optimise hyperparameters and evaluate predictiveper<strong>for</strong>mance (Ambroise and McLachlan, 2002).We considered three classes <strong>of</strong> experimental setups with varying parameters:1. The same sparsity pattern and same weights across all tasks1 http://www.cs.cmu.edu/~xichen/Code/SPG_Multi_Graph.zip179


8. Fused Multitask Penalised Regressiona) Differing sample sizes N = 50, 100, 200 with p = 100, K = 10, σ = 1, and β = 0.1.b) Differing noise levels σ = 0.5, 1, 2, with N = 100, p = 100, K = 10, and β = 0.1.c) Differing number <strong>of</strong> tasks K = 5, 10, 20, with N = 100, p = 100, β = 0.1, and σ = 1.d) Differing weights <strong>of</strong> the causal variables β = 0.05, 0.1, 0.5, with N = 100, p = 100,K = 10, and σ = 1.e) Differing number <strong>of</strong> variables p = 50, 100, 500, 1000, with N = 100, K = 10, σ = 1,and β = 0.1.2. Same sparsity pattern with different weights β j ∼ N (µ β , σ β )a) β j ∼ N (0.5, 0.05), N = 100, p = 100, K = 10, and σ = 1.b) β j ∼ N (0.5, 0.5), N = 100, p = 100, K = 10, and σ = 1.c) β j ∼ N (0.5, 0.2), N = 100, p = 100, K = 10, and σ = 1.3. Different sparsity pattern with different weights — completely unrelated tasksa) β jk ∼ N (0, 1), N = 100, p = 100, K = 10, and σ = 1.4. Same sparsity and same magnitude weights, but weight signs flipped in some tasks —mixed positive/negative correlationa) β ∈ {−0.1, 0.1}, N = 100, p = 100, K = 10, and σ = 1.The reference setup <strong>for</strong> all experiments with the same sparsity and same weights wasN = 100, p = 100, K = 10, β = 0.1, and σ = 1.An illustration <strong>of</strong> the weight matrix B <strong>for</strong> the three setups are shown in Figure 8.3. Thesame-sparsity same-weights setup appears as bands <strong>of</strong> uni<strong>for</strong>m colour across all K tasks. Thesame-sparsity same-weights setup shows bands with differing weights across the tasks. Theunrelated setup is random — each task has its own independent set <strong>of</strong> weights. The inducedinter-task correlations are shown <strong>for</strong> each setup pattern. For same-sparsity same-weights, allcorrelations are high. For the same-sparsity different-weights setup most correlation are high,with some exceptions. For the unrelated setup, correlations are substantially lower. Note thatthese correlations are strongly dependent on the level <strong>of</strong> noise: the higher the noise levels,the lower the inter-task correlations will become, regardless <strong>of</strong> the sparsity pattern.All the input variables x were generated as iid Gaussian random variables. The outputs ywere simulated asy ik = x T i β k + ɛ i , ɛ i ∼ N (0, σ 2 ) i = 1, ..., N, k = 1, ..., K. (8.11)We computed ROC and PRC curves using the absolute weights ∣ ˆβ∣ <strong>of</strong> each model comparedwith the binary sparsity pattern <strong>of</strong> the true weights I(β ≠ 0).The results are threshold180


8.4. Simulation30Same sparsity, same weightsSame sparsity, different weightsUnrelated2520Variables p1510510864weights correlation22 4 6 8 102 4 6 8 10Tasks k2 4 6 8 10Figure 8.3.: An illustration <strong>of</strong> the three sparsity setups used in the multi-task simulations.Top row: absolute values <strong>of</strong> the p × K weight matrix B used <strong>for</strong> generating theoutputs y, <strong>for</strong> models 1, 2, and 3, respectively (model 4 has identical weights andcorrelations in absolute value to model 1). Bottom row: the K × K correlationmatrices <strong>of</strong> the outputs y.181


8. Fused Multitask Penalised Regressionaveraged (Fawcett, 2006) over 50 independent replications to produce average curves. Forexperiments based on using the same sparsity patterns, we selected 80% <strong>of</strong> the weights to bezero, on average. There<strong>for</strong>e, the a null model (no predictive ability) has an expected areaunder ROC curve <strong>of</strong> 0.5, and an expected precision <strong>of</strong> 0.2 (the proportion <strong>of</strong> the positiveclass) <strong>for</strong> all recall levels. We computed an unbiased estimate <strong>of</strong> R 2 <strong>for</strong> the models withhyperparameters tuned in cross-validation and tested on independent data (not used in thecross-validation).For each penalised method, we explored a grid <strong>of</strong> penalties across each penalty. We useda grid <strong>of</strong> 20 penalties λ max × (1, ..., 10 −6 ) and γ = 10 −3,...,6 , where λ max is the smallestlambda that makes all weights in the model zero (see Section 5.4.2) We did not threshold thecorrelation matrix, instead we used the correlation as is, unlike Kim and Xing (2009), sincethe thresholding is <strong>of</strong>ten arbitrary and data dependent. There<strong>for</strong>e, all correlations were usedas the basis <strong>for</strong> graph edges (the set E included all possible edges, with different weights basedon the correlation).8.5. ResultsWe compared the per<strong>for</strong>mance <strong>of</strong> fmpr with other methods in simulation and the DILGOMdata.8.5.1. SimulationSetup 1: Increasing sample size Figure 8.4 shows that as sample size N increased from 50to 200, so did the recovery per<strong>for</strong>mance <strong>of</strong> all methods, with fmpr-w1 and fmpr-w2 showingnotably higher ROC and PRC curves than the other methods. fmpr-w2 showed a smalladvantage over fmpr-w1 in terms <strong>of</strong> recovery. Both fmpr methods increased the recoverywith increasing sample size, much faster than lasso, elastic net, and ridge regression thatimproved only slightly. R 2 increased as well, from approximately 0 <strong>for</strong> all methods to morethan 0.1 <strong>for</strong> fmpr-w1 and fmpr-w2.Setup 2: Increasing noise levels Figure 8.5 shows that as noise levels σ increased, so didthe recovery per<strong>for</strong>mance <strong>of</strong> all methods decrease, until it was not much better than random<strong>for</strong> methods with σ = 2. At the lowest and middle noise levels, fmpr-w1 and fmpr-w2 showedbetter recovery than all other methods, with higher R 2 as well.Setup 3: Increasing number <strong>of</strong> tasks Figure 8.6 shows that with increasing number <strong>of</strong> tasksK, there was no substantial change in the recovery per<strong>for</strong>mance <strong>of</strong> lasso, elastic net, and ridgeregression, which is consistent with our expectation, as these methods ignore the inter-taskdependencies and are there<strong>for</strong>e equivalent to K separate models. The small differences are182


8.5. Resultslikely due to random variation in the simulation data, and the fact that with more tasks, theROC/PRC curves are estimated over more data and hence more precise. In contrast, bothGFlasso-w1/w2 and fmpr-w1/w2 showed increases in recovery and in R 2 , with fmpr-w2having best per<strong>for</strong>mance overall.Setup 4: Increasing weights Figure 8.7 shows that with increasing weights β, all methodsshowed substantial improvements in both true weight recovery and R 2 , from close to randomper<strong>for</strong>mance with β = 0.05 to high and in some case close to perfect per<strong>for</strong>mance <strong>for</strong> β = 0.5. Inthe reference setting, fmpr had better recovery than all other methods, however, in the highweight setting both fmpr and GFlasso per<strong>for</strong>med equally, with ridge regression showing thelowest per<strong>for</strong>mance.Setup 5: Increasing number <strong>of</strong> parameters Figure 8.8 shows that unlike with increasingnumber <strong>of</strong> tasks or samples, when the number <strong>of</strong> parameters p increased, there was no monotonicincrease in per<strong>for</strong>mance, but rather all methods improved when p increased from 50 to100, but then their per<strong>for</strong>mance reduces as p went to 200. This phenomenon may be due tothe fact that the simulations use a fixed proportion <strong>of</strong> the weights as zero (20%), togetherwith a fixed weight <strong>of</strong> 0.1, and a fixed noise level <strong>of</strong> σ = 1. When p = 50, 100, 200, the output ineach task y k is then a weighted sum <strong>of</strong> 5, 20, and 40 weights, respectively, plus random noise.There<strong>for</strong>e, the signal to noise ratio was not identical between the experiments, but increasingin p, leading to an increase in per<strong>for</strong>mance with increasing p, up to the point where there aretoo many variables in the model to estimate from the given amount <strong>of</strong> data. From the pointonwards, an increase in p with N fixed (together with a concomitant increase in the number<strong>of</strong> true non-zero variables), tended to reduce the per<strong>for</strong>mance in recovery <strong>of</strong> the weights.Setup 6 and 7: Differing causal architectures Previous simulation setups explored thescenario where all tasks have the same sparsity pattern and same weights β, showing a strongadvantage <strong>of</strong> fmpr over other <strong>approaches</strong>. In contrast, as seen in Figure 8.9, when tasks hadthe same sparsity pattern but the weights differ slightly (µ β = 0.5, σ β = 0.05), both FMPRand GFlasso per<strong>for</strong>med identically well, exhibiting both better recovery and higher R 2 thanelasticnet, lasso, and ridge. When the weights varied more substantially (µ β = 0.5, σ β = 0.5and σ β = 2), fmpr did not per<strong>for</strong>m better than elastic net or lasso (but better than ridge), andGFlasso-w1/w2 per<strong>for</strong>med slightly better in terms <strong>of</strong> recovery and R 2 . When the tasks werecompletely unrelated, through different sparsity patterns and different weights, all methodsexcept ridge regression per<strong>for</strong>med identically (Figure 8.10).Setup 8: Mixed positive/negative correlations Since the sign <strong>of</strong> the correlation is accounted<strong>for</strong> in the FMPR penalty, we expect to see similar results <strong>for</strong> positive and negative correlations,as long as their magnitude is the same. To verify this, we per<strong>for</strong>med experiments identical to183


8. Fused Multitask Penalised Regressionthe reference setup, except that the sign <strong>of</strong> the weights were randomly flipped in some tasks,creating a mix <strong>of</strong> positive and negative correlations (50% negative on average out <strong>of</strong> all taskpairs). The results were qualitatively similar to the reference setup, confirming that FMPRtakes advantage <strong>of</strong> both negative and positive correlations <strong>of</strong> the tasks.To visualise the effect <strong>of</strong> each method, Figure 8.12 shows the sparsity patterns recovered byeach method in the simulations <strong>for</strong> the reference setup, in one simulation, compared with thetrue simulation weights β. fmpr-w1 and fmpr-w2 produced distinct banding patterns, similarto that <strong>of</strong> the true weights. GFlasso-w2 had similar banding patterns, and to a lesser extentso did GFlasso-w1. In contrast, lasso, ridge, and elastic net did not produce any noticeablebanding patterns at all, with lasso cross-validation selecting a close to empty model (mostweights were zero). These qualitative results complement the quantitative results <strong>for</strong> weightrecovery shown in Figure 8.4b, where fmpr-w1/w2 showed substantially better ROC andPRC curves than the other methods.8.5.2. Experiments on the DILGOM DatasetAs a pro<strong>of</strong> <strong>of</strong> concept <strong>of</strong> the applicability <strong>of</strong> the multi-task approach to real data, we used theDietary, Lifestyle, and Genetic determinants <strong>of</strong> Obesity and Metabolic syndrome (DILGOM)dataset (Inouye et al., 2010b), containing 509 samples in total (234 males, 275 females) assayed<strong>for</strong> 136 metabolites, and gene expression <strong>of</strong> 35,419 genes, with the aim <strong>of</strong> predicting metabolitelevels from the gene expression levels. The preprocessing <strong>of</strong> the data has been described inChapter 7. The DILGOM data represents a more realistic setting than the simulations, asthere are strong correlations both in the inputs and in the outputs, and any pleiotropic geneis likely to have different associations strengths with metabolites, some <strong>of</strong> them potentiallywith opposing signs (upregulate one metabolite, down regulate another).To reduce the computational load, we first used univariable linear regression <strong>of</strong> each metaboliteon each gene, and then selected genes that had at least one metabolite with linear regressionwith p ≤ 5 × 10 −4 (from a t-test with N − 1 degrees <strong>of</strong> freedom). This resulted in p = 2429genes. We also included the 21 clinical variables in order to reduce confounding <strong>of</strong> the geneexpression levels due to clinical factors.We selected cluster 1 <strong>of</strong> the metabolites (defined in Section 7.3.1), a cluster consisting <strong>of</strong> 35metabolites, mainly VLDL subtypes. This cluster exhibited strong correlations Figure 8.13,and all correlations were positive and many were strong (0.8 and higher). We then usedrepeated 5-fold nested cross-validation to estimate the R 2 : we used 4/5ths <strong>of</strong> the data <strong>for</strong>optimising the penalties (through cross-validation), trained the models on the training datausing the optimal penalties found, and tested the models on the independent 1/5th <strong>of</strong> thedata. This process was repeated ten times. Due to the high computational cost, we did notevaluate GFlasso (see Section 8.5.3 <strong>for</strong> timing experiments), elastic net or ridge regression,or the w1 variant <strong>of</strong> fmpr.184


●●●●●●●●●●●●8.5. ResultsROCPRCSensitivity0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNetPrecision0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNet0.0−0.1−0.2−0.3R 2 −0.4−0.5−0.6lassoGFlasso−w2GFlasso−w1FMPRw2FMPRw1ElasticNetMethodridge0.0 0.2 0.4 0.6 0.8 1.0Specificity0.0 0.2 0.4 0.6 0.8 1.0Recall(a) N = 50ROCPRCSensitivity0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNetPrecision0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNet0.150.10R 20.050.00lassoGFlasso−w2GFlasso−w1FMPRw2FMPRw1ElasticNetMethodridge0.0 0.2 0.4 0.6 0.8 1.0Specificity0.0 0.2 0.4 0.6 0.8 1.0Recall(b) N = 100 (reference)ROCPRCSensitivity0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNetPrecision0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNet0.120.100.080.06R 2 0.040.020.00lassoGFlasso−w2GFlasso−w1FMPRw2FMPRw1ElasticNetMethodridge0.0 0.2 0.4 0.6 0.8 1.0Specificity0.0 0.2 0.4 0.6 0.8 1.0Recall(c) N = 200Figure 8.4.: Simulations with varying number <strong>of</strong> samples N (Setup 1), showing recovery <strong>of</strong>true causal variables (ROC/PRC) in the training data and R 2 in test set prediction.185


●●●●●●●●●●●●●●●●●●●●●●●●●●8. Fused Multitask Penalised RegressionROCPRCSensitivity0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNetPrecision0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNet0.50.40.3R 2 0.20.1lassoGFlasso−w2GFlasso−w1FMPRw2FMPRw1ElasticNetMethodridge0.0 0.2 0.4 0.6 0.8 1.0Specificity0.0 0.2 0.4 0.6 0.8 1.0Recall(a) σ = 0.5ROCPRCSensitivity0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNetPrecision0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNet0.150.10R 20.050.00lassoGFlasso−w2GFlasso−w1FMPRw2FMPRw1ElasticNetMethodridge0.0 0.2 0.4 0.6 0.8 1.0Specificity0.0 0.2 0.4 0.6 0.8 1.0Recall(b) σ = 1.0 (reference)ROCPRCSensitivity0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNetPrecision0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNet0.010.00−0.01R 2 −0.02−0.03−0.04lassoGFlasso−w2GFlasso−w1FMPRw2FMPRw1ElasticNetMethodridge0.0 0.2 0.4 0.6 0.8 1.0Specificity0.0 0.2 0.4 0.6 0.8 1.0Recall(c) σ = 2.0Figure 8.5.: Simulations with varying levels <strong>of</strong> noise σ (Setup 2), showing recovery <strong>of</strong> truecausal variables (ROC/PRC) in the training data and R 2 in test set prediction.186


●●●●●●●●●●●●●●●●●●●●●●●8. Fused Multitask Penalised RegressionROCPRCSensitivity0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNetPrecision0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNet0.00●−0.02R 2 −0.04−0.06lassoGFlasso−w2GFlasso−w1FMPRw2FMPRw1ElasticNetMethodridge0.0 0.2 0.4 0.6 0.8 1.0Specificity0.0 0.2 0.4 0.6 0.8 1.0Recall(a) β = 0.05ROCPRCSensitivity0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNetPrecision0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNet0.150.10R 20.050.00lassoGFlasso−w2GFlasso−w1FMPRw2FMPRw1ElasticNetMethodridge0.0 0.2 0.4 0.6 0.8 1.0Specificity0.0 0.2 0.4 0.6 0.8 1.0Recall(b) β = 0.1 (reference)ROCPRCSensitivity0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNetPrecision0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNet0.80.70.60.5R 2 0.40.3● ● ● ●lassoGFlasso−w2GFlasso−w1FMPRw2FMPRw1ElasticNetMethodridge0.0 0.2 0.4 0.6 0.8 1.0Specificity0.0 0.2 0.4 0.6 0.8 1.0Recall(c) β = 0.5Figure 8.7.: Simulations with varying weights β (Setup 4), showing recovery <strong>of</strong> true causalvariables (ROC/PRC) in the training data and R 2 in test set prediction.188


●●●●●●●●●●●●●8.5. ResultsROCPRCSensitivity0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNetPrecision0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNet0.020.010.00R 2 −0.01−0.02lassoGFlasso−w2GFlasso−w1FMPRw2FMPRw1ElasticNetMethodridge0.0 0.2 0.4 0.6 0.8 1.0Specificity0.0 0.2 0.4 0.6 0.8 1.0Recall(a) p = 50ROCPRCSensitivity0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNetPrecision0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNet0.150.10R 20.050.00lassoGFlasso−w2GFlasso−w1FMPRw2FMPRw1ElasticNetMethodridge0.0 0.2 0.4 0.6 0.8 1.0Specificity0.0 0.2 0.4 0.6 0.8 1.0Recall(b) p = 100 (reference)ROCPRCSensitivity0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNetPrecision0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNet0.100.05R 2 0.00lassoGFlasso−w2GFlasso−w1FMPRw2FMPRw1ElasticNetMethodridge0.0 0.2 0.4 0.6 0.8 1.0Specificity0.0 0.2 0.4 0.6 0.8 1.0Recall(c) p = 200Figure 8.8.: Simulations with varying number <strong>of</strong> parameters p (Setup 5), showing recovery<strong>of</strong> true causal variables (ROC/PRC) in the training data and R 2 in test setprediction.189


●●●●●●●●●●●●●●●●●●●●●8. Fused Multitask Penalised RegressionROCPRCSensitivity0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNetPrecision0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNet0.80.7R 2 0.60.50.4LassoGFlasso−w2GFlasso−w1FMPR−w2FMPR−w1ElasticNetMethodRidge0.0 0.2 0.4 0.6 0.8 1.0Specificity0.0 0.2 0.4 0.6 0.8 1.0Recall(a) µ = 0.5, σ β = 0.05ROCPRCSensitivity0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNetPrecision0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNet0.900.850.80R 20.750.700.65LassoGFlasso−w2GFlasso−w1FMPR−w2FMPR−w1ElasticNetMethodRidge0.0 0.2 0.4 0.6 0.8 1.0Specificity0.0 0.2 0.4 0.6 0.8 1.0Recall(b) µ = 0.5, σ β = 0.5ROCPRCSensitivity0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNetPrecision0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNet0.95R 2 0.900.85LassoGFlasso−w2GFlasso−w1FMPR−w2FMPR−w1ElasticNetMethodRidge0.0 0.2 0.4 0.6 0.8 1.0Specificity0.0 0.2 0.4 0.6 0.8 1.0Recall(c) µ = 0.5, σ β = 2Figure 8.9.: Simulations with same sparsity but different weights β (Setup 6), showing recovery<strong>of</strong> true causal variables (ROC/PRC) in the training data and R 2 in test setprediction.190


●●●●●8.5. ResultsROCPRCSensitivity0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNetPrecision0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNet0.0 0.2 0.4 0.6 0.8 1.0Specificity0.0 0.2 0.4 0.6 0.8 1.0Recall0.900.850.80R 2 0.750.70ElasticNetGFlasso−w2GFlasso−w1FMPRw2FMPRw1lassoridgeMethodFigure 8.10.: Simulations with unrelated tasks (Setup 7), showing recovery <strong>of</strong> true causalvariables (ROC/PRC) in the training data and R 2 in test set prediction.191


●●●●●●●Lasso●●Ridge8. Fused Multitask Penalised RegressionROCPRCSensitivity0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNetPrecision0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNet0.0 0.2 0.4 0.6 0.8 1.0Specificity0.0 0.2 0.4 0.6 0.8 1.0Recall0.15R 20.100.050.00GFlasso−w2GFlasso−w1FMPR−w2FMPR−w1ElasticNetMethodFigure 8.11.: Simulations with a mixture <strong>of</strong> positively and negatively correlated tasks (roughly50%/50% each, Setup 8), showing recovery <strong>of</strong> true causal variables (ROC/PRC)in the training data and R 2 in test set prediction.192


8.5. ResultsTrue weightsFMPR−w1FMPR−w2LassoRidgeElasticNetGFlasso−w1GFlasso−w2Figure 8.12.: The true non-zero simulation weights β, and the weights estimated by eachmethod in one replication <strong>of</strong> the reference setup. The intensity <strong>of</strong> the lines representsthe absolute value <strong>of</strong> the estimated weight ˆβ. The vertical and horizontalaxes correspond to variables j = 1, ..., p and tasks k = 1, ..., K, respectively. Notethat the weights <strong>of</strong> the lasso model were all zero.193


8. Fused Multitask Penalised RegressionXXL_VLDL_TGXXL_VLDL_PXXL_VLDL_LXL_VLDL_LVLDL_TG_eFRVLDL_DM_VLDL_PL_VLDL_FCR0.50.60.70.80.91.0L_VLDL_CFAw79StoFAFAw79StoFAL_VLDL_CL_VLDL_FCM_VLDL_PVLDL_DVLDL_TG_eFRXL_VLDL_LXXL_VLDL_LXXL_VLDL_PXXL_VLDL_TGFigure 8.13.: Pearson correlations <strong>for</strong> 35 metabolites from cluster 1 <strong>of</strong> the DILGOM metabolites.Recovery <strong>of</strong> the non-zero weights cannot be evaluated in this data since we do not knowwhich weights are truly non-zero. However, we can qualitatively evaluate the sparsity patterninduced by each model. Examples <strong>of</strong> the recovered sparsity patterns <strong>for</strong> the DILGOM datasetare shown in Figure 8.14 (<strong>for</strong> clarity, we show a subset consisting <strong>of</strong> 200 genes). fmpr-w2tended to produce sparsity patterns that were more consistent across the tasks. In comparison,lasso produced patterns that varied more between tasks.Figure 8.15 shows the nested cross-validated R 2 <strong>for</strong> the different methods applied to the 35DILGOM metabolites using the genes as inputs, <strong>for</strong> each metabolite separately and over allmetabolites. fmpr-w2 showed consistently higher R 2 than the lasso, evaluated either <strong>for</strong> eachindividual metabolites and <strong>for</strong> all metabolites together. Overall, 33 metabolites had a highermedian R 2 <strong>for</strong> fmpr-w2 than <strong>for</strong> lasso, with the exception <strong>of</strong> CH2inFA (methylene groups in194


8.5. ResultsLassoFMPR−w2Variables p300 310 320 330 340 350 360 370 380 390 400Variables p300 310 320 330 340 350 360 370 380 390 4005 10 20 30Tasks k5 10 20 30Tasks kFigure 8.14.: Recovered weights matrices ˆB <strong>for</strong> 200 genes over the 35 metabolites <strong>for</strong> lassoand fmpr-w2, based on penalties optimised by cross-validation. The verticaland horizontal axes represent genes and tasks, respectively. The intensity <strong>of</strong>each point represents the absolute value <strong>of</strong> the weight β.195


8. Fused Multitask Penalised Regression0.6* * * *●●●●0.5●●●0.40.3R 2 0.20.10.0●●●●●●●●●●●●●●●●●●●● ●●●●●CH2inFACH2toDBFAw79StoFAHDL3_CL_VLDL_CL_VLDL_CEL_VLDL_FCL_VLDL_LL_VLDL_PL_VLDL_PLL_VLDL_TGM_VLDL_CM_VLDL_CEM_VLDL_FCM_VLDL_LM_VLDL_PM_VLDL_PLM_VLDL_TGSerum_TGMetaboliteMethodFMPR−w2LassoS_HDL_TGS_VLDL_PS_VLDL_TGTGtoPGVLDL_DVLDL_TGVLDL_TG_eFRXL_VLDL_LXL_VLDL_PXL_VLDL_PLXL_VLDL_TGXS_VLDL_TGXXL_VLDL_LXXL_VLDL_PXXL_VLDL_PLXXL_VLDL_TGFigure 8.15.: Box-and-whisker plots <strong>of</strong> R 2 <strong>for</strong> fmpr-w2 and lasso over 35 metabolites fromcluster 1 <strong>of</strong> the DILGOM metabolites, using gene expression as inputs. Weused 10 × 5 nested cross-validation to produce 50 estimates. The stars representstatistical significance from a Bonferroni-corrected Wilcoxon rank-sum test p ≤0.05/35 = 0.00143.fatty acid chain) and HDL3 C (total cholesterol in high density lipoprotein 3). Out <strong>of</strong> the 35metabolites, 4 metabolites had Wilcoxon rank-sum test p-values lower than the Bonferronicorrected threshold <strong>of</strong> 0.05/35 = 0.00143.8.5.3. Time ComplexityDue to the simplicity <strong>of</strong> the optimisation algorithm, the fused ridge method is fast enoughto fit models to large datasets. However, it will inevitably be slower than the lasso due tothe need to compute the fusion penalty and the need to optimise over two separate penalties.Unlike the lasso, fmpr’s time complexity per iteration in the K tasks grows as O(K 2 ) due tothe need to consider differences between all K × (K − 1)/2 pairs <strong>for</strong> each variable j = 1, ..., p.Figure 8.16 shows the scaling behaviour <strong>of</strong> fmpr and spg (implementing GFlasso) withrespect to different sample sizes N, parameters p, and number <strong>of</strong> tasks K (see Figure C.1<strong>for</strong> boxplots showing the variability over the replications). We used the same penalties <strong>for</strong>both methods. All timings were per<strong>for</strong>med on an Intel Core 2 Duo 3.06Ghz machine running196


8.6. DiscussionLinux. Overall, fmpr substantially outper<strong>for</strong>med spg in all comparisons, especially <strong>for</strong> largernumber <strong>of</strong> variables p and tasks K. We also evaluated the scaled time, where the timings <strong>of</strong>each method were scaled to approximately the same range in order to better visualise theirscaling behaviour. The scaled results show linear increases in time <strong>for</strong> increases in N <strong>for</strong> bothmethods. In contrast, increasing p results in higher than linear increases <strong>for</strong> spg but onlylinear increases <strong>for</strong> fmpr, as fmpr utilises coordinate decent and no matrix operations areinvolved. The results <strong>for</strong> increasing K show faster than linear growth, consistent with the factthat the number <strong>of</strong> task to task edges increases quadratically with the number <strong>of</strong> tasks K.8.6. DiscussionWe have proposed Fused Multitask Penalised Regression (fmpr), a statistical model <strong>for</strong> leveragingtask correlations in the multi-task setting, in order to borrow in<strong>for</strong>mation between correlatedtasks. In contrast with the fused lasso that employs an l 1 fusion penalty, fmpr employsan l 2 fusion penalty, making it amenable to fast optimisation using coordinate descent. Wehave assessed the ability <strong>of</strong> our methods and others to recover the true causal variables insimulations and to predict the phenotype (R 2 ). The l 2 fusion penalty was found to producebetter models when the weights <strong>of</strong> causal variables in related tasks were <strong>of</strong> similar magnitude,a result which was robust to varying settings <strong>of</strong> noise, varying number <strong>of</strong> samples, varyingnumber <strong>of</strong> tasks, varying magnitudes <strong>of</strong> the causal variables, and varying number <strong>of</strong> inputs.When the simulation weights were vastly different, but still had the same sparsity pattern,fmpr per<strong>for</strong>med well, but the l 1 fusion penalty <strong>of</strong> GFlasso results in slightly better modelsand better recovery <strong>of</strong> the non-zero weights. When tasks were completely unrelated, allmethods per<strong>for</strong>med identically, with the exception <strong>of</strong> ridge regression that was substantiallyworse. In an <strong>analysis</strong> <strong>of</strong> the DILGOM dataset, using gene expression to predict concentrations<strong>of</strong> 35 correlated metabolites, fmpr-w2 showed better predictive per<strong>for</strong>mance than thelasso, in terms <strong>of</strong> median R 2 , <strong>for</strong> most <strong>of</strong> the metabolites (33 out <strong>of</strong> 35), by leveraging theinter-metabolite correlations.In summary, this work demonstrates the value <strong>of</strong> multi-task methods in the genomic setting,where multiple phenotypes, such as metabolite concentrations or gene expression levels, aretypically highly correlated. Taking these correlations into account resulted in multi-taskmodels that were consistently better than the single-task models and rarely worse, suggestingthat these methods are safe to use even in realistic settings where the true causal variables arenever known. In addition, the higher computational efficiency <strong>of</strong> fmpr approach means it canbe applied more readily to larger datasets than the spg method, making fmpr potentiallymore useful in <strong>analysis</strong> <strong>of</strong> real datasets.197


8. Fused Multitask Penalised RegressionTime (sec)0.1 0.2 0.3 0.4 0.5 0.6●●FMPRSPG●●●● ●● ● ● ● ●0 1000 2000 3000 4000 5000Samples N●●●●●●Time (scaled)−1.0 0.0 0.5 1.0 1.5 2.0●●●●●●●●●● ● ●● ●0 1000 2000 3000 4000 5000Samples N(a) Increasing samples NTime (sec)0 1 2 3 4 5 6 7FMPRSPG●● ● ● ● ● ●● ●500 1000 1500 2000Variables p●●Time (scaled)−1.0 0.0 0.5 1.0 1.5 2.0●●●●●●●● ● ● ● ●●●●500 1000 1500 2000Variables p(b) Increasing parameters pTime (sec)0 2 4 6 8 10FMPRSPG●●●●●● ●●● ● ●●●Time (scaled)−0.5 0.5 1.0 1.5 2.0 2.5●● ●●● ●●●●●●●●●●0 50 100 150 200Tasks K0 50 100 150 200Tasks K(c) Increasing tasks KFigure 8.16.: Average time to run fmpr over 50 independent replications. (a) p = 400, K = 10.(b) N = 100, K = 10. (c) N = 100, p = 100. The left panel in each subplot showthe wall time, the right panel show time scaled to the same approximate rangein order to better show the trends.198


9ConclusionsA central theme <strong>of</strong> this thesis has been prediction. It is an investigation <strong>of</strong> supervised learningtechniques <strong>for</strong> modelling <strong>of</strong> associations between molecular data, such as gene expression valuesor SNPs, and variables representing clinical or molecular phenotypes. In this paradigm,models are evaluated mainly based on prediction per<strong>for</strong>mance in cross-validation or on independentdata, rather than finding associations that are statistically significant but do not necessarilycontribute much predictive ability, as is commonly the case with many <strong>genome</strong>-<strong>wide</strong>association studies (GWAS). This thesis has shown that the predictive approach to <strong>analysis</strong><strong>of</strong> gene expression data and large genetic data is computationally feasible, produces interpretablemodels <strong>of</strong> the underlying biology, that may be precise enough to enable population<strong>wide</strong>screening <strong>for</strong> celiac disease and type 1 diabetes from SNP data. Sparse linear models <strong>of</strong>SNP data <strong>of</strong>ten produced models with better predictive ability than other <strong>approaches</strong>, such asnon-sparse SVMs (Chapter 5) and models built on SNPs selected using univariable statistics(Chapter 6). In addition, the penalised models were more robust across different datasetswith potentially different genetic architectures (Gibson, 2011). These large differences betweenthe methods, show that estimates <strong>of</strong> the amount <strong>of</strong> “missing heritability” (Manolio etal., 2009) crucially depend on the statistical model employed, and that the phenotypic andgenetic variance may be better explained by sparse penalised methods. In practical terms,these results strongly suggest that penalised methods are preferable <strong>for</strong> predictive modeling<strong>of</strong> <strong>human</strong> <strong>genome</strong>-<strong>wide</strong> case/control data.The penalised models employed in this thesis make the implicit assumption that a “small”number <strong>of</strong> features are relevant to the outcome, and the rest are spurious or unimportant.199


9. ConclusionsThis assumption is necessary <strong>for</strong> several reasons. First, it is biologically plausible to assumethat out <strong>of</strong> the hundreds <strong>of</strong> thousands <strong>of</strong> SNPs tagging variation across the <strong>human</strong> <strong>genome</strong>,most <strong>of</strong> them are not subtantially related to a given disease,. Second, <strong>for</strong> statistical reasons,it is not possible to do meaningful modelling and feature selection under the assumption thatmany or all variables are truly associated with the phenotype, when the sample size is faroutweighed by the number <strong>of</strong> variables (N ≪ p). Even lasso-like <strong>approaches</strong>, which havebeen shown to correctly identify the true associations in high dimensions, crucially depend onthe number <strong>of</strong> these true associations to be bounded relative to the number <strong>of</strong> variables. Incontrast, non-sparse methods are known to not be able to recover the true causal variables inhigh dimensions. This is termed “bet on sparsity” by Hastie et al. (2009a): “Use a procedurethat does well in sparse problems, since no procedure does well in dense problems”. Third, thesparsity is tuned through cross-validation and guided by the predictive per<strong>for</strong>mance. There<strong>for</strong>e,we do not impose excess sparsity on the model if predictive per<strong>for</strong>mance indicates thatthis is not the right thing to do, but in practice this process <strong>of</strong>ten does result in sparse models,either because the sparse models truly are better than dense models due to the (unknown)sparsity <strong>of</strong> the data or because the dense models are inferior because their coefficients cannotbe estimated well due to the high dimensionality (over-fitting).However, having good prediction <strong>of</strong> the phenotype is not sufficient: biological interpretabilityis crucial as well, as once good prediction is achieved, one <strong>of</strong> the most natural questionsto explore next is which <strong>of</strong> the variables are involved in the model and what is their relativeimportance. For example, in <strong>analysis</strong> <strong>of</strong> SNP data we would like to know which SNPs areassociated with the phenotype, and how much variation is explained by each one. For thisreason, lasso linear models are especially attractive as the model weights are directly interpretablein terms <strong>of</strong> their contribution to the overall model, and the models can be madesparse. Moreover, lasso models enjoy what is sometimes called sparsistency (Kim and Xing,2009; Zhao and Yu, 2006), meaning that under certain statistical assumptions, the truly associatedvariables can be recovered from the data with probability one. In practice, theseassumptions cannot usually be verified in real data. However, the advantage <strong>of</strong> lasso modelsover the univariable approach in identifying the correct variables can be quantified empiricallyin simulation, where the true causal variables are known. In the real-world setting <strong>of</strong> <strong>genome</strong><strong>wide</strong>studies, selecting a high significance threshold generates many false positives which maylead to misleading biological interpretation. On the other hand, a stringent cut<strong>of</strong>f will discardmany true associations (false negatives), leading to a small number <strong>of</strong> significant SNPs thatare hard to interpret biologically. Penalised <strong>approaches</strong> are not completely immune to theseproblems, but have far lower false positive rates in recovering true associations, making usmore confident in the SNPs found. Ultimately all data <strong>analysis</strong> is limited by the dataset athand. Most currently available SNP datasets have several thousand samples, which may beenough to confidently detect most <strong>of</strong> the strong effects, but too small to enable detection <strong>of</strong>200


weak effects which may exist. As datasets grow in size, there is a better chance <strong>of</strong> confidentlycapturing the weaker effects, thereby refining our models to include many more SNPs in themand potentially getting a more complete picture <strong>of</strong> the underlying biology.Interpretability is also reduced when different studies produce different lists <strong>of</strong> prognosticgenes, as has been the case with <strong>analysis</strong> <strong>of</strong> breast cancer metastasis datasets. In Chapter 4 weused a “feature engineering” approach in which expression <strong>of</strong> individual genes is aggregatedto expression <strong>of</strong> gene sets, and predictive models are in turn applied to this data. The geneset approach achieved far greater stability and consistency between different studies, thusincreasing the interpretability <strong>of</strong> the genes in terms <strong>of</strong> coherent cellular processes rather thandisparate lists <strong>of</strong> loosely related genes.Recovery <strong>of</strong> causal variables can be quantified using different measures, based on factorssuch as the class balance: the proportion <strong>of</strong> truly causal variables to the non-causal variables.As discussed in Chapter 6, when the truly causal are a small proportion <strong>of</strong> the total, measuressuch as ROC curves and areas under ROC curves (AUC) can be misleading, as high sensitivitymay be implicitly inducing a high number <strong>of</strong> false positives as well. In such cases, weadvocate the use <strong>of</strong> measures such as precision-recall curves, as they highlight potential highfalse positive rates that may otherwise go unnoticed. More generally, these results also highlightthe importance <strong>of</strong> choosing a suitable measure <strong>of</strong> success <strong>for</strong> each problem, as differentmeasures emphasise different aspects <strong>of</strong> the model. There<strong>for</strong>e, we suggest evaluating modelper<strong>for</strong>mance using a variety <strong>of</strong> complementary measures, when possible. While the simulationswere generated based on real haplotype data from HapMap, there is room to explorehow well the simulations represent real case/control data, and whether these results hold inother more extreme settings such as when the case/control labels are more imbalanced.Two important computational themes <strong>of</strong> this thesis have been efficiency and scalability:sophisticated models are <strong>of</strong> little practical use unless they can successfully be applied toreal data. As models have become more sophisticated, so has data size grown: recent SNPmicroarrays approach 2 million SNPs, and sample sizes are increasing as well, with severaldatasets consisting <strong>of</strong> around one hundred thousand samples now available. To enjoy thebenefits <strong>of</strong> models such as the lasso, efficient and scalable algorithms must be developed <strong>for</strong>fitting these models to such data. The algorithms we have discussed in Chapters 5 and 8 arebased on coordinate descent, achieving scaling in computation time that is linear in eitherthe number <strong>of</strong> samples or the number <strong>of</strong> input variables, without requiring all data to bein memory at once, making them particularly attractive in the <strong>genome</strong>-<strong>wide</strong> setting, andenabling us to rapidly analyse large datasets, as we have done in <strong>for</strong> SNP data in Chapter 6and <strong>for</strong> multi-omic data in Chapter 7.Finally, <strong>analysis</strong> <strong>of</strong> large datasets itself produces large amounts <strong>of</strong> output that itself needs tobe post-processed: <strong>for</strong> example, cross-validation produces multiple models, each with its ownversion <strong>of</strong> the model weights. Our <strong>analysis</strong> <strong>of</strong> the DILGOM dataset generated close to 1.5TiB201


9. Conclusions<strong>of</strong> results. There is a need <strong>for</strong> better computational <strong>approaches</strong> to store and postprocess theseresults in order to effectively and efficiently extract in<strong>for</strong>mation from the raw data. Thesechallenges will only increase as datasets grow larger in the number <strong>of</strong> features, and as weintegrate more datatypes, such as in <strong>analysis</strong> <strong>of</strong> larger multi-omic datasets and <strong>analysis</strong> <strong>of</strong>whole <strong>genome</strong> sequencing data.Future WorkThere are several ways in which the work in this thesis can be extended. We begin by consideringseveral extensions to the <strong>analysis</strong> <strong>of</strong> the breast cancer gene expression data. Next, weconsider extensions to the sparse linear models, including lasso and fmpr, and the algorithms<strong>for</strong> fitting these models to data.We analysed the breast cancer datasets using non-sparse methods, such as centroid classifiersand support vector machines. There may be benefit from <strong>analysis</strong> using sparse models.However, the high degree <strong>of</strong> correlation between genes, particularly, between the causal andnon-causal genes, may potentially hamper the ability <strong>of</strong> lasso models to correctly identifythe causal variables (Zhao and Yu, 2006). An important challenge in this area is merging <strong>of</strong>datasets from separate studies, as individual gene expression datasets tend to be limited toseveral hundred samples, limiting the statistical power achieved in any one study. There mayalso be room to examine set statistics that take into account pathway structure, in the <strong>for</strong>m<strong>of</strong> directed graphs. For example, if we consider a gene module as an in<strong>for</strong>mation processingmodule (Alon, 2007) with defined inputs and outputs, we might be able to concentrate ouref<strong>for</strong>ts on several genes deemed to be output nodes (downstream <strong>of</strong> all others nodes in thenetwork) and ignore internal nodes that may only mediate between genes in the module.In the <strong>analysis</strong> <strong>of</strong> SNP data using SparSNP, we have only considered models that areadditive in the allele-dosage. While additivity is a common simplifying assumption in associationstudies, and has been justified on grounds <strong>of</strong> both genetic theory and experiments (Hillet al., 2008), other genetic configurations are known to occur, such as dominant/recessivealleles, and there may be benefit from exploring these models in real data. If there is priorknowledge about the mode <strong>of</strong> action <strong>of</strong> a SNP, then it is a reasonable approach to representit in the model, however, <strong>for</strong> many SNPs the mode <strong>of</strong> action is not known a priori. Exploringall possible combinations <strong>of</strong> all such multivariable models is computationally infeasible(there are k p possible permutations if k models are considered <strong>for</strong> each <strong>of</strong> the p SNPs). Asimpler procedure may be to test several models <strong>for</strong> each SNP independently through a series<strong>of</strong> univariable tests, select one mode <strong>of</strong> action <strong>for</strong> each SNP, and then build a multivariablemodel combining the per-SNP models into one. If non-linear effects are desired, they can beachieved through the use <strong>of</strong> kernel methods, such as polynomial kernels and Gaussian kernels(radial basis function kernels), at the cost <strong>of</strong> reduced interpretability, as the model weights202


are with respect to the samples and not the features. A preliminary investigation <strong>of</strong> a celiacdiseasedataset using SVMs with Gaussian kernels did not find any predictive advantage overthe linear squared hinge loss employed by SparSNP (results not shown). However, improvements<strong>for</strong> other datasets cannot be ruled out, as the genetic architecture likely varies betweendiseases. In addition to genetic data, we may include other variables in the SparSNP modelbesides genotypes, such as age, sex, or any other clinical variable <strong>of</strong> relevance. Conceptually,these variables can be treated just like the genotypes in the model.Another avenue we did not explore is epistasis, loosely defined as non-linear interactionsbetween genes or SNPs. Considering epistasis in the predictive models raises <strong>for</strong>midable computational,statistical, and interpretation challanges. Epistasis must first be detected in thedata, however, there is a combinatorial explosion <strong>of</strong> the number <strong>of</strong> epistatic sets <strong>of</strong> SNPsthat need to be examined, even <strong>for</strong> very simple <strong>for</strong>ms <strong>of</strong> epistasis. For a dataset <strong>of</strong> 500,000SNPs, there are more than 10 11 pairs <strong>of</strong> SNPs that need testing, and <strong>for</strong> triplets there aremore than 2 × 10 16 sets. Clearly, examining even small epistatic sets is computationally demanding.Furthermore, even if all such pairs are tested <strong>for</strong> association with the phenotype,the multiple testing correction <strong>for</strong> the multitude <strong>of</strong> tests will be severe but necessary, as evena tiny false positive rate will incur a large number <strong>of</strong> false detection in absolute numbers.Quality control <strong>of</strong> the data will also need to be especially stringent, as the effects <strong>of</strong> minorgenotyping error or other confounding factors such as cryptic relatedness or populationstratification can potentially be amplified when considering sets <strong>of</strong> SNPs together. Onceputative sets <strong>of</strong> epistatic SNPs are detected, they may be used <strong>for</strong> predictive modelling, asare individual SNPs now. However, substantially larger datasets may be required to reducethe effects <strong>of</strong> overfitting induced by models with potentially very large numbers <strong>of</strong> variables.These obstacles notwithstanding, tackling epistasis, at least simplified versions <strong>of</strong> it, may be afruitful avenue <strong>for</strong> generating more realistic models and consequently <strong>for</strong> explaining yet morephenotypic variation.Another important source <strong>of</strong> phenotypic variation not considered here is the environment,such as gene × environment interactions. The case/control SNP datasets examined here werepurposely designed to minimise confounding by environmental factors, by sampling unrelatedindividuals from the same ethnic populations. However, rather than removing such effects,there may be advantages in leveraging data from individuals in shared environments, as longas this confounding as properly accounted <strong>for</strong>, as has been done in pedigree studies be<strong>for</strong>ethe advent <strong>of</strong> GWAS.Alternative penalisation schemes to the standard lasso include the elastic net (Zhu andHastie, 2005), adaptive lasso (Zou, 2006), or non-convex penalties such as SCAD (Fan andLi, 2001). The advantage <strong>of</strong> some non-convex penalties over the l 1 penalty is that they tendto penalise the non-zero weights less, resulting in less biased estimates and in this sense theybetter approximate the l 0 loss. However, non-convexity means that optimisation is difficult,203


9. Conclusionsin general, since the existence <strong>of</strong> global optima is not guaranteed. Optimising these nonconvexmodels typically involves generating multiple solutions from random initialisationsand choosing the best solution amongst them. The lasso method, like other frequentist<strong>approaches</strong>, does not explicitly take into account prior knowledge that the researcher may haveat their disposal, such as known SNP-gene associations and known gene mappings to genepathways, which could be used to influence the selection <strong>of</strong> SNPs in the model. HierarchicalBayesian models (Gelman and Hill, 2007; Kim and Xing, 2011) can incorporate such priors inthe models, accounting <strong>for</strong> this in<strong>for</strong>mation. In addition, Bayesian models also allowing <strong>for</strong>inclusion <strong>of</strong> multiple populations in the data <strong>analysis</strong>, and enable handling <strong>of</strong> data missingness(imputation) as part <strong>of</strong> the model fitting process.The fused penalty approach fmpr was presented in terms <strong>of</strong> squared loss <strong>for</strong> linear regression,however it can be easily adapted to any linear or log linear model, such as logisticregression and squared hinge loss classification, <strong>for</strong> use with binary phenotypes such ascase/control status. There may be room to employ mixed l 1 /l 2 fusion penalties in order toachieve greater flexibility in modelling, such that another hyperparameter may automaticallydetermine whether to use an l 1 or an l 2 penalty in a data-dependent way, similar to the hyperparameterα in the elastic net (Zhu and Hastie, 2005). In addition, instead <strong>of</strong> using the samevalue <strong>of</strong> the penalty <strong>for</strong> all tasks, it may be useful to have a per-task penalty λ k , k = 1, ..., K,rather than one global penalty λ, potentially allowing <strong>for</strong> better sensitivity to the uniquecharacteristics <strong>of</strong> each task. This will require more computation due to the need to tune eachpenalty separately with cross-validation. There is also room <strong>for</strong> further research into measures<strong>of</strong> task relatedness other than Pearson correlation, and into different trans<strong>for</strong>mations <strong>of</strong> thecorrelation. Given enough data, it may be possible to reliably infer task relatedness from thedata itself, perhaps using partial correlation or sparse partial correlation in the <strong>for</strong>m <strong>of</strong> thegraphical lasso (Friedman et al., 2007).Both SparSNP and fmpr are fitted using coordinate descent, a simple yet effective approach.However, when analysing even larger amounts <strong>of</strong> data such as large SNP arrays orsequencing data, standard coordinate descent may still be too slow. The simplest way to scaleup coordinate descent is parallelisation through splitting the SNPs into several blocks andper<strong>for</strong>ming the updates concurrently, a procedure which has been shown to be feasible in thesense <strong>of</strong> convergence <strong>of</strong> the algorithm (Bradley et al., 2011). This scheme leads to speedupslinear in the degree <strong>of</strong> parallelism, up to the theoretical maximum (after which the algorithmmay not converge). Such an approach would likely increase the size <strong>of</strong> datasets that couldbe analysed into many millions <strong>of</strong> SNPs. Another avenue is to use specialised hardware, suchas high per<strong>for</strong>mance Graphical Processing Units (GPU), which have been used in detection<strong>of</strong> epistasis in SNP data (Kam-Thong et al., 2011). Beyond SNP data, large-scale parallelisationcould potentially enable fitting statistical models to whole-<strong>genome</strong> sequencing data,when such data becomes routinely available <strong>for</strong> large cohorts.204


ASupplementary Results <strong>for</strong> Gene Set StatisticsA.1. ClassifiersIn addition to the centroid classifier, we tested the shrunken centroid (Tibshirani et al.,2003) in the R package pamr (Hastie et al., 2009b), our implementation <strong>of</strong> the classifierfrom (van ’t Veer et al., 2002), and a support vector machine with a linear kernel (kernlabpackage (Karatzoglou et al., 2004)). We optimised the shrunken centroid’s threshold and theSVM’s number <strong>of</strong> features and its l 2 penalty using nested random splits, where the data wasrandomly split into three parts: training, validation, and testing. The model was fit to thetraining data, and its AUC calculated <strong>for</strong> its prediction on the validation data. This wasrepeated over a grid <strong>of</strong> values appropriate <strong>for</strong> each model type. The optimal hyperparameterswere then chosen as the ones maximising the AUC over the validation set. The model wasthen refit using the optimal hyperparameters on the training and validation data together,and tested on the remaining test data. Its AUC over the test data is reported. The wholeprocedure is repeated B times, producing B classifiers (<strong>for</strong> each classifier type), with differentsets <strong>of</strong> optimal hyperparameters. The procedure is per<strong>for</strong>med separately <strong>for</strong> each <strong>of</strong> the fivedatasets.There are conflicting descriptions <strong>of</strong> the exact <strong>for</strong>m <strong>of</strong> the classifier used in (van ’t Veer etal., 2002). In the original paper, it seems that the classifier classifies each sample using itsPearson correlation with each <strong>of</strong> the centroids <strong>of</strong> the positive and negative metastasis classes:ŷ i = arg min {Corr(x i, c j )},j∈{−1,+1}(A.1)205


A. Supplementary Results <strong>for</strong> Gene Set Statisticswhere Corr(⋅) is the Pearson correlation, x i is the ith sample <strong>of</strong> p genes, and c j is the centroid<strong>of</strong> the j class where j ∈ {−1, 1}. In other publications (Tibshirani and Efron, 2002; van deVijver et al., 2002), the classifier said to be based on the correlation <strong>of</strong> the sample with thepositive class only, and a threshold on that correlation is used to determine which class ispredicted:ŷ i =⎧⎪⎨⎪⎩+1 if Corr(x i , c 1 ) ≥ τ;−1otherwise,where c 1 is the centroid <strong>for</strong> the positive class, and τ is a user-specified threshold.(A.2)We implemented both <strong>approaches</strong>, denoting them here VV1 and VV2 respectively. For theVV1 approach, we did not choose a threshold but used the correlation with the positive classas the prediction.The VV2 approach is identical to the centroid classifier used in our work when the sampleshave been normalised so that they have unit norm (McLachlan et al., 2004, pp. 202–203).A.2. Internal ValidationFigures A.1, A.2, A.3, A.4, and A.5, show results <strong>for</strong> internal validation <strong>of</strong> the centroidclassifier, the SVM, the PAM classifier (shrunken centroid), and the two van ’t Veer et al.(2002) classifiers, respectively. For the centroid, recursive feature elimination (RFE) was used.For the other classifiers, all features (all genes or all gene sets) were used.A.3. External ValidationFigure A.6 shows AUC <strong>for</strong> external validation <strong>for</strong> the different models.Significance <strong>of</strong> AUC DifferencesWe used ANOVA to test <strong>for</strong> differences in AUC between the set statistics, produced by thecentroid classifier. AUC is approximately normally distributed when it is not too close to zeroor one. We present the results <strong>for</strong> 1, 8, 64, and 4096 features, as shown in Table A.1. Takinginto account multiple testing, the ANOVA shows that the AUC <strong>for</strong> set centroid, set median,and set t-statistic are not significantly different from the AUC <strong>for</strong> the individual genes. TheAUCs <strong>for</strong> set PC and set U-statistic are significantly lower.206


A.3. External Validation0.80.80.6●●●●●●●●●●●●● ●0.6●●●●●●● ● ●●●●AUCAUC●●0.40.40.2set.centroidsset.mediansset.medoidsset.medoids2●0.2set.centroidsset.mediansset.medoidsset.medoids2●1248163264128256512102420484096541412481632641282565121024204840965414FeaturesFeatures(a) GSE2034(b) GSE49220.80.8AUC0.6●●●●●●●●●●●●●●AUC0.6●●●●●●●●●●●●● ●0.40.40.2set.centroidsset.mediansset.medoidsset.medoids2●0.2set.centroidsset.mediansset.medoidsset.medoids2●1248163264128256512102420484096541412481632641282565121024204840965414FeaturesFeatures(c) GSE6532(d) GSE73900.8●●●●●●● ● ●●●●●0.6●AUC0.40.2set.centroidsset.mediansset.medoidsset.medoids2●1248163264128Features256(e) GSE11121Figure A.1.: Internal validation (mean and 95% CI <strong>for</strong> AUC) <strong>for</strong> centroid classifier with RFE5121024204840965414207


A. Supplementary Results <strong>for</strong> Gene Set Statisticsgsvmgsvm0.80.8AUC0.60.4● ● ●●●●AUC0.60.4●●●●●●0.20.2rawset.centroidsset.mediansset.medoidsset.pcsset.u.lograwset.centroidsset.mediansset.medoidsset.pcsset.u.logStatisticStatistic(a) GSE2034(b) GSE4922gsvmgsvmAUC0.80.60.4●●●●●●AUC0.80.60.4●●●●●●0.20.2rawset.centroidsset.mediansset.medoidsset.pcsset.u.lograwset.centroidsset.mediansset.medoidsset.pcsset.u.logStatisticStatistic(c) GSE6532(d) GSE7390gsvm0.80.6●●●●●●AUC0.40.2rawset.centroidsset.mediansset.medoidsset.pcsset.u.logStatistic(e) GSE11121Figure A.2.: Internal validation (mean and 95% CI <strong>for</strong> AUC) <strong>for</strong> SVM classifier, using allfeatures208


A.3. External Validationpamrpamr0.80.8AUC0.60.4●●●●●●AUC0.60.4● ● ●●●●0.20.2rawset.centroidsset.mediansset.medoidsset.pcsset.u.lograwset.centroidsset.mediansset.medoidsset.pcsset.u.logStatisticStatistic(a) GSE2034(b) GSE4922pamrpamr0.80.8AUC0.60.4●●● ● ● ●AUC0.60.4● ● ● ● ●●0.20.2rawset.centroidsset.mediansset.medoidsset.pcsset.u.lograwset.centroidsset.mediansset.medoidsset.pcsset.u.logStatisticStatistic(c) GSE6532(d) GSE7390pamr0.8● ● ● ●●0.6●AUC0.40.2rawset.centroidsset.mediansset.medoidsset.pcsset.u.logStatistic(e) GSE11121Figure A.3.: Internal validation (mean and 95% CI <strong>for</strong> AUC) <strong>for</strong> PAM classifier209


A. Supplementary Results <strong>for</strong> Gene Set Statisticsvvvv0.80.8AUC0.60.4●●●●●●AUC0.60.4● ● ● ●●●0.20.2rawset.centroidsset.mediansset.medoidsset.pcsset.u.lograwset.centroidsset.mediansset.medoidsset.pcsset.u.logStatisticStatistic(a) GSE2034(b) GSE4922vvvv0.80.8AUC0.60.4●●●●●●AUC0.60.4●●●●●●0.20.2rawset.centroidsset.mediansset.medoidsset.pcsset.u.lograwset.centroidsset.mediansset.medoidsset.pcsset.u.logStatisticStatistic(c) GSE6532(d) GSE7390vv0.80.6●●●● ● ●AUC0.40.2rawset.centroidsset.mediansset.medoidsset.pcsset.u.logStatistic(e) GSE11121Figure A.4.: Internal validation (mean and 95% CI <strong>for</strong> AUC) <strong>for</strong> VV1 classifier210


A.3. External Validationvv2vv20.80.8AUC0.60.4●● ● ●●●AUC0.60.4●●●● ● ●0.20.2rawset.centroidsset.mediansset.medoidsset.pcsset.u.lograwset.centroidsset.mediansset.medoidsset.pcsset.u.logStatisticStatistic(a) GSE2034(b) GSE4922vv2vv20.80.80.6●●●●●0.6● ● ●● ● ●AUC0.4●AUC0.40.20.2rawset.centroidsset.mediansset.medoidsset.pcsset.u.lograwset.centroidsset.mediansset.medoidsset.pcsset.u.logStatisticStatistic(c) GSE6532(d) GSE7390vv20.80.6●●●●●●AUC0.40.2rawset.centroidsset.mediansset.medoidsset.pcsset.u.logStatistic(e) GSE11121Figure A.5.: Internal validation (mean and 95% CI <strong>for</strong> AUC) <strong>for</strong> VV2 classifier211


A. Supplementary Results <strong>for</strong> Gene Set Statistics0.7●● ● ● ● ● ● ● ● ● ● ●0.7●● ● ● ● ● ● ● ● ● ● ●●●●●AUC0.6AUC0.60.5rawset.centroidsset.mediansset.medoidsset.pcsset.t.statset.u.stat.pval.log●0.5rawset.centroidsset.mediansset.medoidsset.pcsset.t.statset.u.stat.pval.log●81632641282565121024204840965414819216384222151248163264128256512102420484096541481921638422215124FeaturesFeatures(a) Centroid with RFE (point estimate)(b) Centroid with RFE (point + 95% confidenceinterval)0.7●●●0.70AUC0.6AUC●●0.5rawset.centroidsset.medoids●●0.65●rawset.centroidsset.medians●●541422215541422215FeaturesFeatures(c) SVM(d) PAM0.70.7●●●AUC0.6AUC0.6●●0.5●rawset.centroidsset.medoids●●0.5rawset.centroidsset.medoids●●541422215541422215FeaturesFeatures212(e) VV1(f) VV2Figure A.6.: External validation (mean and 95% CI <strong>for</strong> AUC) <strong>for</strong> all models


A.3. External Validationset.centroidsset.mediansset.medoidsScore−0.15 −0.05 0.05Score−0.15 −0.05 0.05Score−0.20 −0.10 0.00 0.100 1000 2000 3000 4000 50000 1000 2000 3000 4000 50000 1000 2000 3000 4000 5000Gene set rankGene set rankGene set rankset.pcsset.t.statset.u.stat.pval.logScore−0.15 −0.05 0.05 0.15Score−0.15 −0.05 0.05 0.15Score−0.1 0.0 0.1 0.2 0.3 0.40 1000 2000 3000 4000 50000 1000 2000 3000 40000 1000 2000 3000 4000 5000Gene set rankGene set rankGene set rankER−/HER2−ER+/HER2−HER2+Figure A.7.: Kolmogorov-Smirnov plots <strong>for</strong> overlap between the gene sets and the modules<strong>of</strong> Desmedt et al. (2008).213


A. Supplementary Results <strong>for</strong> Gene Set StatisticsGATA3 (209604_s_at)GATA3 (209602_s_at)GATA3 (209603_at)FOXA1 (204667_at)ESR1 (205225_at)ESR1 (215552_s_at)ESR1 (211235_s_at)ESR1 (211234_x_at)ESR1 (211233_x_at)ESR1 (217190_x_at)PGR (208305_at)ERBB2 (216836_s_at)ERBB2 (210930_s_at)MKI67 (212021_s_at)MKI67 (212020_s_at)MKI67 (212023_s_at)HMMR (209709_s_at)HMMR (207165_at)AURKA (208079_s_at)AURKA (204092_s_at)MKI67 (212022_s_at)TOP2A (201292_at)TOP2A (201291_s_at)PGR (213227_at)PGR (201701_s_at)PGR (207624_s_at)ERBB2 (217941_s_at)PGR (201121_s_at)AURKA (218580_x_at)PGR (206608_s_at)PGR (213959_s_at)ESR1 (215551_at)AURKA (208080_at)PGR (201120_s_at)ESR1 (217163_at)ESR1 (211627_x_at)420−2−4−6−8SubtypeER−/HER2−ER+/HER2−HER2+Figure A.8.: Heatmap <strong>of</strong> selected genes in the combined dataset (932 samples), showing thethree subclasses ER−/HER2−, ER+/HER2−, and HER2+.214


A.3. External ValidationEstimate Std. Error t value Pr(>∣t∣)(Intercept) 0.6545 0.0116 56.63 0.0000set.centroids -0.0103 0.0163 -0.63 0.5301set.medians -0.0083 0.0163 -0.51 0.6110set.medoids -0.0084 0.0163 -0.52 0.6073set.pcs -0.0290 0.0163 -1.77 0.0784set.t.stat 0.0199 0.0163 1.22 0.2263set.u.stat.pval.log -0.1615 0.0163 -9.88 0.0000(a) 1 feature/s, ANOVA F -test p < 2.22 × 10 −16Estimate Std. Error t value Pr(>∣t∣)(Intercept) 0.6836 0.0103 66.11 0.0000set.centroids 0.0093 0.0146 0.64 0.5260set.medians 0.0110 0.0146 0.75 0.4548set.medoids 0.0067 0.0146 0.46 0.6452set.pcs -0.0684 0.0146 -4.68 0.0000set.t.stat -0.0035 0.0146 -0.24 0.8134set.u.stat.pval.log -0.1853 0.0146 -12.67 0.0000(b) 8 feature/s, ANOVA F -test p < 2.22 × 10 −16Estimate Std. Error t value Pr(>∣t∣)(Intercept) 0.6982 0.0098 71.46 0.0000set.centroids -0.0095 0.0138 -0.68 0.4947set.medians -0.0121 0.0138 -0.88 0.3813set.medoids -0.0122 0.0138 -0.88 0.3781set.pcs -0.0637 0.0138 -4.61 0.0000set.t.stat -0.0180 0.0138 -1.30 0.1957set.u.stat.pval.log -0.1910 0.0138 -13.83 0.0000(c) 64 feature/s, ANOVA F -test p < 2.22 × 10 −16Estimate Std. Error t value Pr(>∣t∣)(Intercept) 0.6989 0.0100 70.22 0.0000set.centroids -0.0187 0.0141 -1.33 0.1873set.medians -0.0174 0.0141 -1.24 0.2187set.medoids -0.0212 0.0141 -1.50 0.1353set.pcs -0.0775 0.0141 -5.51 0.0000set.t.stat -0.0366 0.0141 -2.60 0.0103set.u.stat.pval.log -0.1799 0.0141 -12.78 0.0000(d) 4096 feature/s, ANOVA F -test p < 2.22 × 10 −16Table A.1.: <strong>of</strong> external-validation AUC different numbers <strong>of</strong> features. The AUC <strong>for</strong> individualgenes is used as the intercept.215


BSupplementary Results <strong>for</strong> Sparse LinearModelsB.1. Scoring Measures <strong>for</strong> Causal SNP DetectionŶ = 1 Ŷ = 0True TP FNFalse FP TNTable B.1.: The confusion matrix <strong>of</strong> predicted versus actual classes. “True” is truly causalSNPs, “False” is non-causal SNPs, Ŷ = 1 and Ŷ = 0 are predictions <strong>of</strong> causal andnon-causal SNPs, respectively.Binary classification is usually evaluated through the confusion matrix, the cross-tabulation<strong>of</strong> predicted class Ŷ versus actual class, as shown in Table B.1. Common statistics derivedfrom the confusion table include:• Sensitivity, recall, true positive rate (TPR) = TP / (TP + FN)• False positive rate (FPR) = FP / (FP + TN)• Precision = TP / (TP + FP)The statistics are evaluated over different cut<strong>of</strong>fs <strong>of</strong> the classifier’s output. Receiver operatingcharacteristic (ROC) curves are the curves induced by plotting the (TPR, FPR) pairs <strong>for</strong>217


B. Supplementary Results <strong>for</strong> Sparse Linear Modelsall cut<strong>of</strong>f values. The ROC curves can be summarised using the Area Under the receiver operatingcharacteristic Curve (AUC) (Hanley and McNeil, 1982), computed through numericalintegration or alternatively estimated asÂUC =N1+N + N − ∑N −∑ {I(ŷ i > ŷ j ) + 1 2 I(ŷ i = ŷ j )} ,i=1 j=1(B.1)where N + and N − are the number <strong>of</strong> cases and controls respectively, ŷ i is ith the prediction<strong>for</strong> the ith case, ŷ j is the prediction <strong>for</strong> the jth control, and I(⋅) is the indicator functionevaluating to 1 if its argument is true and to zero otherwise. Eq. B.1 shows that the sampleAUC is the maximum likelihood estimate <strong>of</strong> the probability <strong>of</strong> correctly ranking a randomlyselectedcausal SNP more highly than a randomly-selected non-causal SNP (with correction<strong>for</strong> ties). The expected AUC <strong>for</strong> a classifier producing random predictions is 0.5, perfectpredictions have AUC = 1.0 and perfectly-wrong predictions have an AUC = 0.0.Another useful statistic is the Area under the Precision-Recall Curve (APRC, also knownas Average Precision), which can be integrated numerically, but is usually approximated asÂPRC = 1 MM∑ Prec m ,m=1(B.2)where Prec m is the precision <strong>for</strong> the mth level <strong>of</strong> recall, out <strong>of</strong> M levels. The expectedAPRC <strong>for</strong> a classifier producing random predictions is the proportion <strong>of</strong> positive samples. Forestimating APRC, we used the program perf (http://kodiak.cs.cornell.edu/kddcup/s<strong>of</strong>tware.html).Unlike the APRC, the AUC does not depend on the relative proportions <strong>of</strong> the classes(the class balance). However, the AUC, as commonly used, can be misleading when used<strong>for</strong> comparing classifiers when the proportion <strong>of</strong> causal SNPs is very small, as is the case inGWA. Consider our HAPGEN simulations, where only 148 <strong>of</strong> the 73, 832 SNPs are true causalSNPs. To see why AUC is not in<strong>for</strong>mative in these settings, consider a thought experimentsimilar to that used by Sonnenburg et al. (2006), where we have a classifier that at some cut<strong>of</strong>fcorrectly classifies 100% <strong>of</strong> the true causal SNPs (TPR=1), but also wrongly classifies 1%<strong>of</strong> the non-causal SNPs (FPR=0.01). The AUC is the area under the curve induced by theTPR and the FPR, and is monotonically increasing. There<strong>for</strong>e, the AUC in this case mustbe ≥ 0.99, which seems like very good discrimination. However, when there are 73 684 noncausalSNPs, even the low false positive rate <strong>of</strong> 1% implies 0.01 × 73 684 ≈ 737 false positiveson average. In comparison, even assuming a fixed recall (=TPR) <strong>of</strong> 1, so that the APRCis equal to the precision, then the number <strong>of</strong> false positives needs to be as low as 148 (thenumber <strong>of</strong> causal SNPs) <strong>for</strong> the precision and APRC to be 0.5, and conversely, with a falsepositive rate <strong>of</strong> just 0.5% leading to ∼ 368 false positives on average, both the precision andthe APRC drop to 148/(148 + 368) = 0.287. In many real world settings, such extreme results218


B.2. Real Dataare highly unlikely — usually, the recall/TPR is lower than 1 and the FPR is higher too, inwhich case the APRC is even lower.Clearly, the APRC is more sensitive to the number <strong>of</strong> false positives than the AUC, andthis is an important consideration when screening <strong>for</strong> SNPs in GWA since we would like tokeep the number <strong>of</strong> false positives low as biological validation <strong>of</strong> candidates is costly and timeconsuming.B.2. Real DataB.2.1. Checking <strong>for</strong> StratificationWe used the genomic inflation factor λ and principal component <strong>analysis</strong> (PCA) to assesspopulation structure and cryptic relatedness in the celiac disease and WTCCC datasets.Genomic Inflation FactorsGenomic inflation factors λ <strong>for</strong> the discovery datasets are shown in Table B.2, <strong>for</strong> both the1-df allelic χ 2 test and the logistic regression genotypic test.Dataset λ 1-df λ logisticBD 1.186 1.0CAD 1.140 1.0Crohn 1.181 1.0Celiac1 1.051 1.051Celiac2-UK 1.056 1.056HT 1.131 1.0RA 1.111 1.0T1D 1.120 1.0T2D 1.150 1.0Table B.2.: Genomic inflation factors λ estimated by PLINK v1.07 using the median <strong>of</strong> statistics<strong>for</strong> either the 1-df χ 2 test (--assoc --adjust) or the logistic regression test(without covariates, --logistic --adjust).Principal Component AnalysisWe used smartpca from EIGENSOFT v4.0beta (Price et al., 2006) to estimate the principalcomponents <strong>of</strong> the genotypes <strong>for</strong> each dataset. It is known that regions <strong>of</strong> high LD, suchas the major histocompatibility complex (MHC) region on chr6, can produce clusters in theprincipal components, and can be misinterpreted as ancestral stratification (Patterson et al.,2006). We expect that population effects would show reasonably uni<strong>for</strong>mly across the SNPs,219


B. Supplementary Results <strong>for</strong> Sparse Linear Modelswhereas effects due to LD would be localised to certain regions. There<strong>for</strong>e, we used an LDthinningapproach similar to that <strong>of</strong> Fellay et al. (2009), namely1. Removed high-LD regions from the data, including chr5: 44Mb–51.5Mb, chr6: 25Mb–33.5Mb, chr8: 8Mb–12Mb, and chr11: 45Mb–57Mb.2. Thinned the remaining SNPs using PLINK --indep-pairwise using a window size <strong>of</strong>1500 SNPs, step size <strong>of</strong> 150, and r 2 ≤ 0.2.3. In smartpca, regressed each SNP on the previous 5 SNPs (nsnpldregress option), andremoved outliers.4. Inspection <strong>of</strong> the PCA loadings <strong>of</strong> each SNP <strong>for</strong> each PC, identifying whether someregions contribute more to each PC, <strong>for</strong> the original data and the LD-pruned data.Any stratification remaining after this filtering is more likely to be due to true populationdifferences than to LD effects, and if the remaining stratification is strong there it suggeststhat the original dataset is stratified due to population structure.The top 5 PCs are shown in Figure B.3, <strong>for</strong> the original Celiac1 dataset (all autosomalSNPs) and <strong>for</strong> the LD-thinned version. (We examined the top 10 PCs however PCs beyond5 did not show strong association with the phenotype). There was strong stratification inthe original dataset, and the PCs were strongly predictive <strong>of</strong> the case/control status. Thetop PCs were predictive <strong>of</strong> case/control status in the original Celiac1 dataset (AUC ∼ 0.8);after LD-pruning, the AUC dropped to ∼ 0.54 (Figure B.5), indicating that the bulk <strong>of</strong> thepredictive ability had been removed.Boxplots <strong>of</strong> the PC loadings <strong>for</strong> each chromosome (Figure B.4) show that in the originalCeliac1 data, chromosome 6 and chromosome 8 have vastly more SNPs with large loadingsin the 1st and 2nd PCs, respectively. Other chromosome with large contributions are chromosome5 (PC4 and PC5) chromosome 11 (PC4, PC6, PC9, and PC10). In comparison, thecontributions in the LD-thinned data are much more uni<strong>for</strong>m across the chromosomes, asexpected when there is no population stratification.B.2.2. AUC <strong>for</strong> Stringent FilteringWe prepared stringently-filtered versions <strong>of</strong> the Celiac1, Celiac2-UK, and WTCCC-T1Ddatasets, following the procedure in Lee et al. (2011). The datasets were filtered in PLINKby the following criteria. SNPs were removed if they had• MAF < 0.01 (--maf),• missingness > 0.05 (--geno)• deviation from Hardy-Weinberg equilibrium in controls (--hardy) p < 0.05220


B.2. Real Data• differential missingness (--test-missing) between cases and controls with p < 0.05,• two-locus test (Lee et al., 2010) p < 0.05Samples were removed if they had• missingness > 0.01 (--mind)• both samples in each pair with relatedness ˆπ > 0.05 were removed (--<strong>genome</strong>)Post-filtering, the datasets contained 2109 and 6613 samples <strong>for</strong> Celiac1 and Celiac2-UKrespectively, and 279,312 and 471,191 SNPs <strong>for</strong> Celiac1 and Celiac2-UK respectively.WTCCC-T1D, the stringent dataset had 4901 samples (no samples removed) and 370,280SNPs.AUC <strong>for</strong> cross-validation in stringently-filtered Celiac1, Celiac2-UK, and WTCCC-T1Ddatasets are shown in Figure B.6.The assumed population prevalence <strong>for</strong> each disease is shown in Table B.3.Disease Prevalence K (%) SourceBD 1 (Bebbington and Ramana, 1995; Wray et al., 2010)CAD 5.6 (Wray et al., 2010)Celiac 1 (van Heel and West, 2006)Crohn’s/IBD 0.1 (Carter et al., 2004; Wray et al., 2010)HT 13.1 (NHS, 2010)RA 0.75 (Wray et al., 2010)T1D 0.54 (Wray et al., 2010)T2D 3 (Wray et al., 2010)Table B.3.: Population prevalence <strong>for</strong> each disease as used in this work.ForB.2.3. PPV/NPVApart from AUC, we examined several measures <strong>of</strong> classification per<strong>for</strong>mance: sensitivity/specificity(ROC curve), precision/recall, and PPV/NPV. Examining the individual plotscan show anomalies in the data that are obscured when only considering aggregate statisticssuch as the AUC.We examined the results <strong>for</strong> one cross-validation fold <strong>for</strong> the WTCCC-T1D dataset (Fig. B.9),the Celiac1 dataset both original and after stringent filtering (Fig. B.10 and Fig. B.11), andthe Celiac2-UK dataset <strong>for</strong> both original and stringent filtering (Fig. B.12 and Fig. B.13).Most datasets had a small number <strong>of</strong> samples that were highly ranked as disease in terms <strong>of</strong>the classifier’s confidence—<strong>for</strong> the squared hinge loss classifier this means high positive values<strong>of</strong> the linear predictor l. In turn, this is reflected in the plots as Specificity=1, Precision=1,and PPV=1 on the left hand side <strong>of</strong> the curves.221


B. Supplementary Results <strong>for</strong> Sparse Linear ModelsB.2.4. Comparison with Other MethodsWe compared the lasso squared hinge loss model with two other <strong>approaches</strong>: multivariablelogistic regression and GCTA (Yang et al., 2011).The logistic regression method included: (1) screening <strong>of</strong> the SNPs using univariable logisticregression and selection <strong>of</strong> SNPs based on p-value; (2) SNP thinning based on LD to reducemulticollinearity; and (3) fitting an unpenalised multivariable logistic regression to the selectedSNPs. The entire logistic regression was repeated within the cross-validation loop (p-valueswere computed only on training data <strong>for</strong> each fold), using the same training and testing datasplits as employed by the lasso models.Compared with the 3-step logistic regression, the lasso models achieved equivalent or betterAUC across all datasets, using the same number <strong>of</strong> SNPs in both models. Specifically, lassohad higher AUC in BD, Celiac1/Celiac2, Crohn, RA, and T2D, and close to equal per<strong>for</strong>mancein CAD, HT, and WTCCC-T1D. To evaluate model fitting, we compared the per<strong>for</strong>mance<strong>of</strong> the lasso to 3-step logistic regression on subsamples <strong>of</strong> the WTCCC-T1D dataset (FigureB.15). The results indicate that the lasso-based method was more stable across a range<strong>of</strong> model sizes. In comparison, the 3-step logistic regression was more sensitive to choice <strong>of</strong>model size than the lasso-based method. Two possible explanations are that the univariablescreening is discarding potentially predictive SNPs, in comparison to the lasso that considersall SNPs concurrently, and that the heavy penalisation inherent in the lasso approach makesit more resistant to overfitting, especially <strong>for</strong> smaller sample sizes.In addition to logistic regression, we also considered a second method <strong>for</strong> multivariablemodeling, namely GCTA (Yang et al., 2011) which was used to fit a multivariable mixedeffectlinear model to all autosomal SNPs in each dataset. GCTA produces estimates <strong>of</strong> theexplained phenotypic variance in the data, however estimates <strong>of</strong> variance validated on independenttest data tend to be smaller than those evaluated within the same data used toderive the estimates (Makowsky et al., 2011). There<strong>for</strong>e, we used cross-validation to producean estimate <strong>of</strong> explained variance. In each cross-validation fold, we fitted the GCTA modelsto the training data, then derived GCTA’s SNP score (BLUP). The SNP scores were usedto estimate a per-sample risk score <strong>for</strong> independent test data, and the risk scores were usedto rank the samples in the AUC calculation. We then used the AUC from the test data,together with estimated population prevalence, to estimate the proportion <strong>of</strong> explained phenotypicvariance in the test data (Table B.4). Lasso models achieved higher AUC than GCTA<strong>for</strong> three datasets: GCTA models produced AUC <strong>of</strong> approximately 0.81, 0.72, and 0.65 <strong>for</strong>Celiac1/Celiac2-UK, T1D, and RA, respectively, whereas the corresponding cross-validationAUC <strong>for</strong> lasso was 0.88, 0.88, and 0.74 <strong>for</strong> Celiac1/Celiac2-UK, T1D, and RA respectively.For Crohn, lasso produced AUC <strong>of</strong> 0.70 and <strong>for</strong> GCTA was slightly lower at AUC = 0.65. Forthe remainder <strong>of</strong> the datasets (BD, CAD, HT, and T2D), GCTA produced AUC similar tothe lasso models. Correspondingly, GCTA models explained 19%–20% <strong>of</strong> phenotypic variance222


B.2. Real Data<strong>for</strong> Celiac1/Celiac2-UK, while in other diseases not more than 10% were explained, includingWTCCC-T1D. In comparison, in cross-validation lasso explained up to 32% <strong>of</strong> the phenotypicvariance <strong>for</strong> Celiac1/Celiac2-UK (21%–38% in independent replication), and up to 28% <strong>of</strong> thephenotypic variance <strong>for</strong> WTCCC-T1D (22% in GoKinD-T1D replication).K AUC VarExp NMean 95% LCL 95% UCL Mean 95% LCL 95% UCLBD 0.010 0.672 0.667 0.678 0.053 0.050 0.057 9CAD 0.056 0.619 0.610 0.629 0.038 0.032 0.045 9Celiac1 0.010 0.817 0.813 0.820 0.202 0.197 0.208 60Celiac2-UK 0.010 0.808 0.805 0.811 0.190 0.186 0.195 30Crohn’s 0.001 0.649 0.639 0.660 0.026 0.022 0.029 9HT 0.131 0.620 0.611 0.628 0.047 0.041 0.054 9RA 0.008 0.648 0.639 0.657 0.036 0.032 0.041 9WTCCC-T1D 0.005 0.720 0.716 0.723 0.078 0.075 0.081 30T2D 0.030 0.632 0.623 0.641 0.040 0.035 0.046 9Table B.4.: AUC and proportion <strong>of</strong> phenotypic variance explained <strong>for</strong> GCTA (Yang et al.,2011), using 3-fold cross-validation (CV). AUC was derived from the per-samplescores in the test folds <strong>for</strong> each cross-validation fold. The 95% confidence intervalis from a one-sample t-test, and explained variance (including the confidence intervals)is estimated from the AUC and prevalence K using the method <strong>of</strong> Wrayet al. (2010). The column denoted N is the number <strong>of</strong> AUC values estimated incross-validation — each 3CV produces N = 3 AUC values.RandomForest and Gradient Boosting MachinesWe evaluated the case/control predictive per<strong>for</strong>mance <strong>of</strong> RandomForest (implemented in the Rpackage randomForest, Liaw and Wiener (2002)) and Gradient Boosting Machines (GBM) (Rpackage gbm, Ridgeway (1999, 2012)). Due to the computational overhead <strong>of</strong> these methodsin time and space, we could not run these methods on the entire SNP datasets. Instead, weran them in 10 × 10-fold cross-validation on chr6 <strong>of</strong> the Celiac1 dataset (2200 samples, 19,169SNPs). For the RandomForest, we used an m try = 2 with 500 or 5000 trees. For GBM,we used logistic regression (bernoulli loss) as a base learner, an interaction depth <strong>of</strong> 3, 500or 5000 trees, and shrinkage <strong>of</strong> 0.001. For SparSNP, we used lasso model with a grid <strong>of</strong> 50penalties.Overall, results <strong>for</strong> SparSNP were similar to those achieved by GBM and substantially bettterthan RandomForest using 500 trees. GBM with 5000 trees had best overall per<strong>for</strong>mance,with a small predictive advantage over SparSNP and a larger advantage over RandomForesteither 500 or 5000 trees (Table B.5). The predictive advantages <strong>of</strong> GBM with 5000 trees223


B. Supplementary Results <strong>for</strong> Sparse Linear Modelsare potentially both due to the large number <strong>of</strong> trees used and the fact that GBM took intoaccount 2-way and 3-way interactions, whereas SparSNP and RandomForest did not.As randomForest and gbm are R packages, they are limited in the amount <strong>of</strong> SNP datathat can be analysed, since all data must be loaded into RAM. For autoimmune diseases suchas celiac disease, using a small number <strong>of</strong> SNPs from chr6 is sufficient, and such models canbe readily fitted. However, <strong>for</strong> other diseases like CAD or T2D, the signal appears to be morespread across the <strong>genome</strong>, and analysing all SNPs would be preferable, as done by SparSNP.Apart from memory requirements, both randomForest and gbm are considerably slower thanSparSNP (using the settings above), taking several hours to complete an <strong>analysis</strong> <strong>of</strong> chr6 <strong>for</strong>the models with 500 trees, compared with several minutes <strong>for</strong> SparSNP, when each methodwas run in parallel on the same 10 core machine.Method AUC 95% LCL 95% UCLSparSNP 0.889 0.887 0.891GBM (500 trees) 0.883 0.872 0.894GBM (5000 trees) 0.895 0.883 0.907RandomForest (500 trees) 0.805 0.775 0.834RandomForest (5000 trees) 0.823 0.800 0.844Table B.5.: AUC estimated in 10×10-fold cross-validation on chr6 in the Celiac1 dataset (2200samples, 19,169 SNPs).224


B.2. Real Data1005000.80.60.40.22500+5000++0.80.60.40.210000++20000++APRC0.80.60.40.250000++100000++0.80.60.40.2++++124816326412825651210242048univariable124816Number <strong>of</strong> SNPs with non−zero effects326412825651210242048univariablelasso + univariableRisk Ratio● 2.0 ● 1.5 ● 1.4 ● 1.3 ● 1.2 ● 1.1Figure B.1.: APRC <strong>for</strong> HAPGEN simulations, using either lasso squared-hinge loss models(lasso) or the univariable logistic regression Wald test (univariable). For thelasso, different numbers <strong>of</strong> SNPs are allowed in the model, as determined by thepenalty λ. For the univariable test, all SNPs are considered. For lasso, resultsare smoothed using LOESS over the replications. For univariable, results areaveraged over the replications. The dotted vertical lines show the number <strong>of</strong>true “causal” SNPs in the data.225


B. Supplementary Results <strong>for</strong> Sparse Linear Models0.90.80.70.60.5100+++500+++250050000.90.80.70.60.5++++AUC0.90.80.70.60.510000++20000+0.90.80.70.60.550000+100000+124816326412825651210242048univariable124816Number <strong>of</strong> SNPs with non−zero effects326412825651210242048univariablelasso + univariableRisk Ratio● 2.0 ● 1.5 ● 1.4 ● 1.3 ● 1.2 ● 1.1Figure B.2.: AUC <strong>for</strong> HAPGEN simulations, using either lasso squared-hinge loss models(lasso) or the univariable logistic regression Wald test (univariable). For thelasso, different numbers <strong>of</strong> SNPs are allowed in the model, as determined by thepenalty λ. For the univariable test, all SNPs are considered. For lasso, resultsare smoothed using LOESS over the replications. For univariable, results areaveraged over the replications. The dotted vertical lines show the number <strong>of</strong>true “causal” SNPs in the data.226


B.2. Real Data(a) Original Celiac1 data(b) Thinned Celiac1 dataFigure B.3.: The first 5 principal components (PCs) <strong>of</strong> (a) original Celiac1 data and (b) afterremoving high LD regions, thinning, and regression <strong>of</strong> previous SNPs. The strongstructure in the top PCs is largely removed by accounting <strong>for</strong> LD. PCs 6–10 wereonly weakly predictive <strong>of</strong> the phenotype and are not shown <strong>for</strong> clarity.227


B.2. Real Data10 10 10 10 9 8 7 7 7 6 4 3 1 1 1 1 1 1 110 8 7 7 7 7 7 7 7 7 5 4 4 4 3 3 2 0AUC0.50 0.55 0.60 0.65 0.70 0.75 0.80● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●AUC0.50 0.55 0.60 0.65 0.70 0.75 0.80● ● ● ● ● ● ● ●● ● ● ●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●−7 −6 −5 −4 −3 −2−6.5 −6.0 −5.5 −5.0 −4.5 −4.0 −3.5log(Lambda)log(Lambda)(a) Original Celiac1 data(b) Pruned Celiac1 dataFigure B.5.: 10-fold cross-validated AUC <strong>for</strong> prediction <strong>of</strong> case/control status from the top10 principal components <strong>of</strong> the Celiac1 dataset, using lasso logistic regressionwith glmnet (Friedman et al., 2010), selecting increasing number <strong>of</strong> principalcomponents (right to left) (a) original dataset and (b) after LD-pruning.229


B. Supplementary Results <strong>for</strong> Sparse Linear Models0.880.860.84AUC0.82DatasetCeliac1Celiac2−UK0.800.781 2 4 8 16 32 64 128 256 512 1024 2048Number <strong>of</strong> SNPs with non−zero effects(a) Celiac disease0.880.860.84AUC0.820.800.781 2 4 8 16 32 64 128 256 512 1024 2048Number <strong>of</strong> SNPs with non−zero effects(b) WTCCC-T1DFigure B.6.: LOESS-smoothed (with 95% pointwise confidence intervals about the mean)AUC <strong>for</strong> lasso models <strong>of</strong> stringently-filtered (a) Celiac1 and Celiac2-UK and(b) WTCCC-T1D, both in 30 × 3-fold cross-validation.230


B.2. Real Data0.9AUC0.80.70.6DatasetBDCADCeliac1Celiac2−UKCrohnHTRAT1DT2D0.51 2 4 8 16 32 64 128 256 512 1024 2048Number <strong>of</strong> SNPs with non−zero effects(a) lasso squared-hinge lossAUC0.80.70.6DatasetBDCADCeliac1Celiac2−UKCrohnHTRAT1DT2D0.51 2 4 8 16 32 64 128 256 512 1024 2048Number <strong>of</strong> SNPs with non−zero effects(b) logistic regressionFigure B.7.: LOESS-smoothed AUC <strong>for</strong> models in 20×3-fold cross-validation.231


B. Supplementary Results <strong>for</strong> Sparse Linear Models1.00.8PPV0.60.4●●●● ●●●Dataset● BDCADCeliac1Celiac2−UKCrohnHTRAT1DT2D●0.2●●●●●●0.88 0.90 0.92 0.94 0.96 0.98NPV(a) lasso squared-hinge loss1.00.8PPV0.60.4●●●Dataset● BDCADCeliac1Celiac2−UKCrohnHTRAT1DT2D0.2●● ● ● ● ● ●●0.88 0.90 0.92 0.94 0.96 0.98NPV(b) logistic regression232Figure B.8.: Averaged PPV/NPV <strong>for</strong> models in 20×3-fold cross-validation.


B.2. Real DataT1DSpecificity0.0 0.2 0.4 0.6 0.8 1.0Precision0.4 0.6 0.8 1.00.0 0.2 0.4 0.6 0.8 1.0Sensitivity0.0 0.2 0.4 0.6 0.8 1.0Recallppv0.0 0.2 0.4 0.6 0.8 1.0ppv0.0 0.2 0.4 0.6 0.8 1.0● ● ●●●●●●● ●●●●●●●● ●●●0.990 0.994 0.998npv1 5 50 500IndexFigure B.9.: Summary plots <strong>of</strong> one fold <strong>of</strong> cross-validation prediction in the WTCCC-T1Ddata. The fourth panel shows the PPV in rank order <strong>of</strong> NPV, to better highlightthe samples with PPV=1.233


B. Supplementary Results <strong>for</strong> Sparse Linear ModelsCD1Specificity0.0 0.2 0.4 0.6 0.8 1.0Precision0.4 0.6 0.8 1.00.0 0.2 0.4 0.6 0.8 1.0Sensitivity0.0 0.2 0.4 0.6 0.8 1.0Recallppv0.0 0.2 0.4 0.6 0.8 1.0ppv0.0 0.2 0.4 0.6 0.8 1.0●● ●●●●●●●●●●●●●●●●● ●●0.990 0.994 0.998npv1 2 5 10 50 200IndexFigure B.10.: Summary plots <strong>of</strong> one fold <strong>of</strong> cross-validation prediction in the Celiac1 data.The fourth panel shows the PPV in rank order <strong>of</strong> NPV, to better highlight thesamples with PPV=1.234


B.2. Real DataCD1_stringentSpecificity0.0 0.2 0.4 0.6 0.8 1.0Precision0.4 0.6 0.8 1.00.0 0.2 0.4 0.6 0.8 1.0Sensitivity0.0 0.2 0.4 0.6 0.8 1.0Recallppv0.0 0.2 0.4 0.6 0.8 1.0ppv0.0 0.2 0.4 0.6 0.8 1.0●● ●●●●●●●●●●●●● ●●0.990 0.994 0.998npv1 2 5 10 50 200IndexFigure B.11.: Summary plots <strong>of</strong> one fold <strong>of</strong> cross-validation prediction in the Celiac1 data afterstringent filtering. The fourth panel shows the PPV in rank order <strong>of</strong> NPV, tobetter highlight the samples with PPV=1.235


B. Supplementary Results <strong>for</strong> Sparse Linear ModelsCD2Specificity0.0 0.2 0.4 0.6 0.8 1.0Precision0.3 0.5 0.7 0.90.0 0.2 0.4 0.6 0.8 1.0Sensitivity0.0 0.2 0.4 0.6 0.8 1.0Recallppv0.0 0.2 0.4 0.6 0.8 1.0ppv0.0 0.2 0.4 0.6 0.8 1.0● ● ●●●●●●● ●●●● ●●● ●● ●●●●0.990 0.994 0.998npv1 5 50 500IndexFigure B.12.: Summary plots <strong>of</strong> one fold <strong>of</strong> cross-validation prediction in the Celiac2-UK data.The fourth panel shows the PPV in rank order <strong>of</strong> NPV, to better highlight thesamples with PPV=1.236


B.2. Real DataCD2_stringentSpecificity0.0 0.2 0.4 0.6 0.8 1.0Precision0.4 0.6 0.8 1.00.0 0.2 0.4 0.6 0.8 1.0Sensitivity0.0 0.2 0.4 0.6 0.8 1.0Recallppv0.0 0.2 0.4 0.6 0.8 1.0ppv0.0 0.2 0.4 0.6 0.8 1.0● ●● ● ●●●●●●● ●●●0.990 0.994 0.998npv1 5 50 500IndexFigure B.13.: Summary plots <strong>of</strong> one fold <strong>of</strong> cross-validation prediction in the Celiac2-UK dataafter stringent filtering. The fourth panel shows the PPV in rank order <strong>of</strong> NPV,to better highlight the samples with PPV=1.237


B. Supplementary Results <strong>for</strong> Sparse Linear Models0.350.30VarExp0.250.200.150.10DatasetBDCADCeliac1Celiac2−UKCrohnHTRAT1DT2D0.051 2 4 8 16 32 64 128 256 512 1024 2048Number <strong>of</strong> SNPs with non−zero effects(a) lasso squared-hinge loss0.350.30VarExp0.250.200.150.10DatasetBDCADCeliac1Celiac2−UKCrohnHTRAT1DT2D0.050.001 2 4 8 16 32 64 128 256 512 1024 2048Number <strong>of</strong> SNPs with non−zero effects(b) logistic regressionFigure B.14.: LOESS-smoothed proportion explained phenotypic variance, over 20×3-foldcross-validation.238


B.2. Real DataAUC0.80.60.40.80.60.40.80.60.42 0 2 2 2 2 8 2 10501002002 0 2 2 2 4 2 6 2 8 2 10 2 0 2 2 2 4 2 6 2 8 2 10 4 2 64008001600Methodlassologis32004901Number <strong>of</strong> SNPs with non−zero effectsFigure B.15.: LOESS-smoothed AUC <strong>for</strong> lasso squared-hinge loss classifier and logistic regression<strong>for</strong> random subsamples <strong>of</strong> the T1D data. For each prespecified sizeN ∈ {50, 100, 200, 400, 800, 1600, 3200}, we randomly sampled the original 4901samples (without replacement) to <strong>for</strong>m a smaller dataset. The subsampling wasrepeated 30 times <strong>for</strong> N = 50, 20 times <strong>for</strong> N = 100, and 10 times <strong>for</strong> the rest.Within each subsampled dataset, we ran 10 × 3CV to evaluate the AUC (<strong>for</strong>example, 30 × 10 × 3CV <strong>for</strong> N = 50). For N = 4901, we used the original datasetwithout sampling, running 20 × 3CV.239


B. Supplementary Results <strong>for</strong> Sparse Linear ModelsB.2.5. Principal Component Analysis <strong>of</strong> Cases(a) Celiac1(b) Celiac1 stringent filtering(c) Celiac2-UK(d) Celiac2-UK stringent filteringFigure B.16.: Principal Component Analysis (PCA) <strong>of</strong> the cases only, using the top 100SNPs identified by the lasso <strong>for</strong> the Celiac1 and Celiac2-UK datasets, andtheir stringently-filtered versions. Samples are colored by median specificityin the cross-validation replications: median specificity≥ 0.99 (red), and the rest(black).240


B.2. Real DataFigure B.17.: Principal Component Analysis (PCA) <strong>of</strong> the cases only, using the top 100 SNPsidentified by the lasso <strong>for</strong> T1D. Samples are colored by median specificity inthe cross-validation replications: median specificity≥ 0.99 (red), and the rest(black).241


B. Supplementary Results <strong>for</strong> Sparse Linear ModelsB.3. Results <strong>for</strong> each datasetFor each disease, we show results over 20 × 3-fold cross-validation• AUC versus the number <strong>of</strong> SNPs with non-zero effects;• PPV versus NPV, using the population prevalence specified <strong>for</strong> each disease;• Explained phenotypic variance versus the number <strong>of</strong> SNPs with non-zero effects, basedon the prevalence specified <strong>for</strong> each disease;• Top SNPs selected to be in the model. The SNPs were ranked based on the proportion<strong>of</strong> times they are non-zero in a model times the inverse size <strong>of</strong> the model, over allcross-validation replications.For several datasets, we also evaluated two subsets <strong>of</strong> the data: “MHC”, which is all SNPson the major-histocompatibility (MHC) region on chr6 (29.7Mb–33.3Mb), and “−MHC”,which is all autosomal SNPs outside the MHC.242


B.3. Results <strong>for</strong> each datasetB.3.1. Bipolar Disorder (BD)Cases Controls SNPs Plat<strong>for</strong>m Autoimmune Prevalence1868 2938 459 012 ∗ Affy SNP 5.0 No 1%Table B.6.: BD dataset, autosomes only. Prevalence from Bebbington and Ramana (1995);Wray et al. (2010).BD0.65AUC0.600.55SubsetMethodAll:lassoAll:logistic0.50PPV0.70.60.50.40.30.20.11 2 4 8 16 32 64 128 256 512 1024 2048Number <strong>of</strong> SNPs with non−zero effects●●●● ●●●●(a) AUC●●●● ●● ●●●●●●●●● ● ● ●0.9910.9920.9930.9940.9950.9960.997NPV●Method●lassologistic(b) PPV/NPV, based on models with 1000–2000 nonzeroSNPs <strong>for</strong> lasso and 200–300 <strong>for</strong> logistic regression.1. rs28359272. rs101637343. rs9878234. rs21473845. rs169783866. rs28030747. rs28368308. rs7476409. rs576643810. rs155822111. rs1701656612. rs268399213. rs93180114. rs134583215. rs4152144716. rs673664117. rs651225718. rs1015247819. rs490047620. rs701522321. rs29581322. rs934484423. rs193656624. rs445710025. rs671900926. rs784551027. rs236908828. rs1098165029. rs800786530. rs8090696SNPWeightedProportion0.00.20.40.60.81.00.050.04BD123455−1010−1515−2020−2525−3030−35# SNPs with non−zero effects(d) Top SNPs35−4040−45VarExp0.030.02SubsetMethodAll:lassoAll:logistic0.010.001 2 4 8 16 32 64 128 256 512 1024 2048Number <strong>of</strong> SNPs with non−zero effects(c) Explained phenotypic variance.Figure B.18.: AUC, PPV/NPV, and explained phenotypic variance <strong>for</strong> Bipolar Disease.243


B. Supplementary Results <strong>for</strong> Sparse Linear ModelsB.3.2. Coronary Artery Disease (CAD)Cases Controls SNPs Plat<strong>for</strong>m Autoimmune Prevalence1926 2938 459 012 ∗ Affy SNP 5.0 No 5.6%Table B.7.: CAD dataset, autosomes only. Prevalence from Wray et al. (2010).CAD0.660.640.62AUC0.600.58SubsetMethodAll:lassoAll:logistic0.560.54PPV1.00.80.60.40.21 2 4 8 16 32 64 128 256 512 1024 2048Number <strong>of</strong> SNPs with non−zero effects●●●●●●● ● ●●●(a) AUC●● ●●●●●●●●●●●●●●●●● ● ● ●0.95 0.96 0.97 0.98NPVMethod●lassologistic(b) PPV/NPV, based on models with 1000–2000 nonzeroSNPs <strong>for</strong> lasso and 200–300 <strong>for</strong> logistic regression.1. rs107385692. rs40580613. rs7411164. rs131530715. rs3524736. rs169091827. rs101864078. rs20393909. rs1047045610. rs473339611. rs468157712. rs613260713. rs1213586214. rs1693815615. rs209614116. rs39157817. rs1050245018. rs600443019. rs1701656620. rs1702690721. rs1337881022. rs1051168623. rs644483124. rs200923425. rs1713767226. rs756425327. rs1074024428. rs1096474829. rs1702486230. rs10492044SNPWeightedProportion0.00.20.40.60.81.00.07CAD123455−1010−1515−2020−25# SNPs with non−zero effects25−3030−3535−400.06(d) Top SNPs0.05VarExp0.040.03SubsetMethodAll:lassoAll:logistic0.020.011 2 4 8 16 32 64 128 256 512 1024 2048Number <strong>of</strong> SNPs with non−zero effects(c) Explained phenotypic variance.Figure B.19.: AUC, PPV/NPV, and explained phenotypic variance <strong>for</strong> Coronary Artery Disease.244


B.3. Results <strong>for</strong> each datasetB.3.3. Celiac Disease (Celiac)Cases Controls SNPs Plat<strong>for</strong>m Autoimmune Prevalence1 778 1422 301 689 ∗ Illumina HumanHap300 Yes 1%2 1849 4936 516 504 ∗ Illumina HumanHap550 Yes 1%Table B.8.: Celiac datasets, autosomes only. Prevalence from van Heel and West (2006).0.9Celiac10.8AUC0.70.6SubsetMethodAll:lassoAll:logisticMHC:lassoMHC:logistic−MHC:lasso−MHC:logistic0.5PPV1.00.80.60.40.21 2 4 8 16 32 64 128 256 512 1024 2048Number <strong>of</strong> SNPs with non−zero effects●●●●●●●●(a) AUC●● ●●●●●●●●●●● ● ●● ●●●●●●●●●●●●●●●●●●●●0.992 0.994 0.996 0.998NPVMethod●lassologistic(b) PPV/NPV, using models with 50–100 non-zeroSNPs1. * rs21876682. * rs31313793. * rs2049914. * rs2049995. * rs20501896. * rs5587027. * rs93571528. * rs28490159. * rs645761710. * rs775651611. * rs285670512. * rs777495413. * rs927755414. * rs40489015. * rs311723016. * rs313479217. * rs276397918. * rs777522819. * rs387146620. * rs285402821. rs676274322. * rs312976323. rs937969224. rs1311972325. rs149944726. rs1050560427. rs287941428. rs724527729. rs673514130. rs1382601SNPWeightedProportion0.00.20.40.60.81.0Celiac1# SNPs with non−zero effectsVarExp0.30.20.1SubsetMethodAll:lassoAll:logisticMHC:lassoMHC:logistic−MHC:lasso−MHC:logistic123455−1010−1515−2020−2525−30(d) Top SNPs0.01 2 4 8 16 32 64 128 256 512 1024 2048Number <strong>of</strong> SNPs with non−zero effects(c) Explained phenotypic variance.Figure B.20.: AUC, PPV/NPV, and explained phenotypic variance <strong>for</strong> Celiac1.245


B. Supplementary Results <strong>for</strong> Sparse Linear ModelsCeliac2−UK0.85AUC0.800.750.700.65SubsetMethodAll:lassoAll:logisticMHC:lassoMHC:logistic−MHC:lasso−MHC:logistic0.60PPV0.8 ●0.60.40.21 2 4 8 16 32 64 128 256 512 1024 2048Number <strong>of</strong> SNPs with non−zero effects●●●● ●●●●(a) AUC● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●0.992 0.994 0.996 0.998NPVMethod●lassologistic(b) PPV/NPV, using models with 50–100 non-zeroSNPs <strong>for</strong> lasso and 200–300 <strong>for</strong> logistic regression1. * rs21876682. * rs17942823. * rs2049994. * rs9360505. * rs2049906. * rs31299627. * rs92752248. * rs30998449. * rs284901510. * rs775651611. * rs20499112. * rs285830813. * rs935715214. * rs106335515. * rs42991616. * rs776227917. * rs285699718. rs598292619. * rs312976320. rs374739321. * rs311723022. rs985196723. * rs1721151024. * rs206447825. rs155981026. * rs221989327. * rs776537928. * rs205018929. rs232783230. rs2162610SNPWeightedProportion0.00.20.40.60.81.0Celiac2−UKVarExp0.350.300.250.200.150.10SubsetMethodAll:lassoAll:logisticMHC:lassoMHC:logistic−MHC:lasso−MHC:logistic# SNPs with non−zero effects123455−1010−1515−20(d) Top SNPs20−2525−300.051 2 4 8 16 32 64 128 256 512 1024 2048Number <strong>of</strong> SNPs with non−zero effects(c) Explained phenotypic variance.Figure B.21.: AUC, PPV/NPV, and explained phenotypic variance <strong>for</strong> Celiac2-UK.246


B.3. Results <strong>for</strong> each datasetB.3.4. Crohn’s Disease/Inflammatory Bowel Disease (Crohn’s)Cases Controls SNPs Plat<strong>for</strong>m Autoimmune Prevalence1748 2938 459 012 ∗ Affy SNP 5.0 Yes 0.1%Table B.9.: Crohn’s dataset, autosomes only. Prevalence from Carter et al. (2004); Wray etal. (2010).Crohn0.70AUC0.650.600.55SubsetMethodAll:lassoAll:logisticMHC:lassoMHC:logistic−MHC:lasso−MHC:logisticPPV1.00.80.60.40.21 2 4 8 16 32 64 128 256 512 1024 2048Number <strong>of</strong> SNPs with non−zero effects●●●●●●●●●●(a) AUC● ●●●●●●●●●●●●●●●● ●0.9992 0.9994 0.9996 0.9998NPV●Method●lassologistic(b) PPV/NPV, based on models with 500–1000 nonzeroSNPs <strong>for</strong> lasso and 150–200 <strong>for</strong> logistic regression.1. rs169460182. rs99336943. rs112089934. rs14775. rs29393376. rs131177447. rs49121918. rs49574859. rs126080910. rs275525111. rs1046498812. rs1764306013. rs928843214. rs192541015. rs193656616. rs1336116517. rs4148414818. rs20644119. rs1181380320. rs1096662521. rs1694612222. rs1712962723. rs186316724. rs807021225. rs470537726. rs91705127. rs789856528. rs150170029. rs86644930. rs32581SNPWeightedProportion0.00.20.40.60.81.00.06Crohn123455−1010−1515−20# SNPs with non−zero effects20−2525−3030−350.05(d) Top SNPsVarExp0.040.030.020.01SubsetMethodAll:lassoAll:logisticMHC:lassoMHC:logistic−MHC:lasso−MHC:logistic0.001 2 4 8 16 32 64 128 256 512 1024 2048Number <strong>of</strong> SNPs with non−zero effects(c) Explained phenotypic variance.Figure B.22.: AUC, PPV/NPV, and explained phenotypic variance <strong>for</strong> Crohn’s.247


B. Supplementary Results <strong>for</strong> Sparse Linear ModelsB.3.5. Hypertension (HT)Cases Controls SNPs Plat<strong>for</strong>m Autoimmune Prevalence1952 2938 459 012 ∗ Affy SNP 5.0 No 13.1%Table B.10.: HT dataset, autosomes only. Prevalence from NHS (2010).HT0.650.60AUC0.55SubsetMethodAll:lassoAll:logistic0.50PPV1.00.80.60.40.21 2 4 8 16 32 64 128 256 512 1024 2048Number <strong>of</strong> SNPs with non−zero effects● ● ●●● ● ●●●(a) AUC●● ●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●● ● ●0.88 0.90 0.92 0.94 0.96NPVMethod●lassologistic(b) PPV/NPV, based on models with 1000–2000 nonzeroSNPs <strong>for</strong> lasso and 200–300 <strong>for</strong> logistic regression.1. rs117681532. rs42604653. rs173259984. rs38534165. rs47158936. rs104584157. rs15026308. rs125410839. rs1702486210. rs928843211. rs1777097712. rs1020548713. rs465550614. rs1230758915. rs476552816. rs1684534217. rs735885618. rs1763319019. rs81631120. rs232009721. rs70006022. rs194358123. rs477352424. rs468703825. rs1016373426. rs378750627. rs96466728. rs57902529. rs490962930. rs17137672SNPWeightedProportion0.00.20.40.60.81.00.08HT123455−1010−1515−2020−2525−30# SNPs with non−zero effects30−3535−4040−450.06(d) Top SNPsVarExp0.04SubsetMethodAll:lassoAll:logistic0.020.001 2 4 8 16 32 64 128 256 512 1024 2048Number <strong>of</strong> SNPs with non−zero effects(c) Explained phenotypic variance.Figure B.23.: AUC, PPV/NPV, and explained phenotypic variance <strong>for</strong> Hypertension.248


B.3. Results <strong>for</strong> each datasetB.3.6. Rheumatoid Arthritis (RA)Cases Controls SNPs Plat<strong>for</strong>m Autoimmune Prevalence1860 2938 459 012 ∗ Affy SNP 5.0 Yes 0.75%Table B.11.: RA dataset, autosomes only. Prevalence from Wray et al. (2010).RA0.75AUC0.700.650.60SubsetMethodAll:lassoAll:logisticMHC:lassoMHC:logistic−MHC:lasso−MHC:logistic0.55PPV1.00.80.60.40.21 2 4 8 16 32 64 128 256 512 1024 2048Number <strong>of</strong> SNPs with non−zero effects●●●●●●●●●(a) AUC●●●●●●●●●● ● ●●●●●●0.993 0.994 0.995 0.996 0.997 0.998NPV●Method●lassologistic(b) PPV/NPV, based on models with 200–300 nonzeroSNPs <strong>for</strong> lasso and 50–100 <strong>for</strong> logistic regression.1. rs37633072. rs25176183. rs25175094. rs414144535. rs66789326. rs30998447. rs15598738. rs46344399. rs983723110. rs1687088011. rs203250012. rs285721013. rs313011314. rs289424915. rs207451216. rs458720717. rs927557218. rs98052619. rs4150975320. rs444035021. rs4148414822. rs267843523. rs224837324. rs156384925. rs91705126. rs1252931327. rs374994628. rs181073829. rs683365630. rs41365948SNPWeightedProportion0.00.20.40.60.81.00.12RA123455−1010−1515−20# SNPs with non−zero effects20−2525−3030−350.10(d) Top SNPsVarExp0.080.060.04SubsetMethodAll:lassoAll:logisticMHC:lassoMHC:logistic−MHC:lasso−MHC:logistic0.021 2 4 8 16 32 64 128 256 512 1024 2048Number <strong>of</strong> SNPs with non−zero effects(c) Explained phenotypic variance.Figure B.24.: AUC, PPV/NPV, and explained phenotypic variance <strong>for</strong> Rheumatoid Arthritis.249


B. Supplementary Results <strong>for</strong> Sparse Linear ModelsB.3.7. Type 1 Diabetes (WTCCC-T1D)Cases Controls SNPs Plat<strong>for</strong>m Autoimmune Prevalence1963 2938 459 012 ∗ Affy SNP 5.0 Yes 0.54%Table B.12.: T1D dataset, autosomes only. Prevalence from Wray et al. (2010).T1D0.85AUC0.800.750.700.65SubsetMethodAll:lassoAll:logisticMHC:lassoMHC:logistic−MHC:lasso−MHC:logistic0.600.55PPV1.0 ●0.80.60.40.21 2 4 8 16 32 64 128 256 512 1024 2048Number <strong>of</strong> SNPs with non−zero effects●●●●●●●(a) AUC●● ● ●●●●●●●●●●●●●●●●0.995 0.996 0.997 0.998 0.999NPVMethod●lassologistic(b) PPV/NPV, based on models with 200–300 nonzeroSNPs <strong>for</strong> lasso and 100–150 <strong>for</strong> logistic regression.1. rs69203632. rs92960033. rs69068464. rs49472965. rs168689436. rs12647027. rs66789328. rs31299329. rs938019710. rs945753711. rs313178412. rs929583713. rs4138104614. rs471338215. rs210158216. rs1704101817. rs1716547318. rs4151024919. rs4139904820. rs390913421. rs225463922. rs289424923. rs313532224. rs926118925. rs691374626. rs4152144727. rs250805228. rs714124929. rs1687088030. rs4438244SNPWeightedProportion0.00.20.40.60.81.00.300.25T1D123455−1010−1515−20# SNPs with non−zero effects(d) Top SNPs20−2525−30VarExp0.200.150.10SubsetMethodAll:lassoAll:logisticMHC:lassoMHC:logistic−MHC:lasso−MHC:logistic0.051 2 4 8 16 32 64 128 256 512 1024 2048Number <strong>of</strong> SNPs with non−zero effects(c) Explained phenotypic variance.Figure B.25.: AUC, PPV/NPV, and explained phenotypic variance <strong>for</strong> Type 1 Diabetes.250


B.3. Results <strong>for</strong> each datasetB.3.8. Type 2 Diabetes (T2D)Cases Controls SNPs Plat<strong>for</strong>m Autoimmune Prevalence1924 2938 459 012 ∗ Affy SNP 5.0 No 3%Table B.13.: T2D dataset, autosomes only. Prevalence from Wray et al. (2010).T2D0.65AUC0.60SubsetMethodAll:lassoAll:logistic0.55PPV1.00.80.60.40.21 2 4 8 16 32 64 128 256 512 1024 2048Number <strong>of</strong> SNPs with non−zero effects●●●●●●(a) AUC●●●●●●●●●●●●●●●●●●●●●●●●● ●0.975 0.980 0.985 0.990NPV●Method●lassologistic(b) PPV/NPV, based on models with 1000–2000 nonzeroSNPs <strong>for</strong> lasso and 250–300 <strong>for</strong> logistic regression.1. rs25130672. rs126395983. rs70842864. rs105099455. rs8963666. rs7000607. rs101637348. rs20578489. rs4148414810. rs1078737911. rs137023712. rs1718535513. rs1050570414. rs464155215. rs44061716. rs1235778117. rs434997618. rs1725754219. rs1082548120. rs723729321. rs141472822. rs214738423. rs153890124. rs477352425. rs1078628426. rs290986927. rs667646528. rs1688244029. rs1049026830. rs9560303SNPWeightedProportion0.00.20.40.60.81.00.060.05T2D123455−1010−1515−2020−25# SNPs with non−zero effects(d) Top SNPs25−3030−35VarExp0.040.03SubsetMethodAll:lassoAll:logistic0.020.010.001 2 4 8 16 32 64 128 256 512 1024 2048Number <strong>of</strong> SNPs with non−zero effects(c) Explained phenotypic variance.Figure B.26.: AUC, PPV/NPV, and explained phenotypic variance <strong>for</strong> Type 2 Diabetes.251


●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●CSupplementary Results <strong>for</strong> FMPRTime (sec)0.60.4FMPRSPGTime (sec)864FMPRSPGTime (sec)10864FMPRSPG0.222●100 250 500 750 1000 2000 3000 4000 5000Samples N0● ● ●100 200 300 400 500 1000 1500 2000Variables p0● ● ● ● ● ● ●2 5 10 25 50 75 100 150 200Tasks K(a) Increasing samples N(b) Increasing parameters p(c) Increasing tasks KFigure C.1.: Time to run fmpr over 50 independent replications. (a) p = 400, K = 10. (b)N = 100, K = 10. (c) N = 100, p = 100.253


Bibliography1000 Genomes Project Consortium. A map <strong>of</strong> <strong>human</strong> <strong>genome</strong> variation from population-scalesequencing. Nature, 467:1061–1073, 2010.M. Ackermann and K. Strimmer. A general modular framework <strong>for</strong> gene set enrichment<strong>analysis</strong>. BMC Bioinfo., 10:47, 2009.Affymetrix, Inc. BRLMM: an Improved Genotype Calling Method <strong>for</strong> the GeneChip HumanMapping 500K Array Set. Technical report, 2006.A. Agresti. Categorical Data Analysis. Wiley, 2nd edition, 2002.M. Ala-Korpela. Critical evaluation <strong>of</strong> 1H NMR metabonomics <strong>of</strong> serum as a methodology <strong>for</strong>disease risk assessment and diagnostics. Clinical chemistry and laboratory medicine CCLMFESCC, 46(1):27–42, 2008.A. Albergaria, J. Paredes, B. Sousa, F. Milanezi, V. Carneiro, J. Bastos, S. Costa, D. Vieira,N. Lopes, E. W. Lam, N. Lunet, and F. Schmitt. Expression <strong>of</strong> FOXA1 and GATA-3 inbreast cancer: the prognostic significance in hormone receptor-negative tumours. BreastCancer Research, 11:R40, 2009.U. Alon. An Introduction to Systems Biology. Chapman & Hall/CRC, 2007.D. G. Altman and J. M. Bland. Diagnostic tests 2: predictive values. BMJ, 309:102, 1994.C. Ambroise and G. J. McLachlan. Selection bias in gene extraction on the basis <strong>of</strong> microarraygene-expression data. Proc. Natl. Acad. Sci., 99:6562–6566, 2002.M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis,K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis,S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G. Sherlock.Gene Ontology: tool <strong>for</strong> the unification <strong>of</strong> biology. Nat. Genet., 25:25–29, 2000.J. E. Aten, T. F. Fuller, A. J. Lusis, and S. Horvath. Using genetic markers to orient theedges in quantitative trait networks: the NEO s<strong>of</strong>tware. BMC Systems Biology, 2(1):34,2008.255


BibliographyK. L. Ayers and H. J. Cordell. SNP Selection in Genome-Wide and Candidate Gene Studiesvia Penalized Logistic Regression. Genet. Epidemiol., 34:879–891, 2010.F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with Sparsity-InducingPenalties. Technical report, INRIA, 2011.D. Baek, J. Villén, C. Shin, F. D. Camargo, S. P. Gygi, and D. P. Bartel. The impact <strong>of</strong>microRNAs on protein output. Nature, 455:64–71, 2008.M. Bahlo, J. Stankovich, P. Danoy, P. F. Hickey, B. V. Taylor, S. R. Browning, M. a. Brown,and J. P. Rubio. Saliva-derived DNA per<strong>for</strong>ms well in large-scale, high-density singlenucleotidepolymorphism microarray studies. Cancer Epidemiology, Biomarkers & Prevention,19:794–8, 2010.E. Bair and R. Tibshirani. Semi-Supervised Methods to Predict Patient Survival from GeneExpression Data. PLoS Biology, 2:0511–0522, 2004.J. C. Barrett, S. Hansoul, D. L. Nicolae, J. H. Cho, R. H. Duerr, J. D. Rioux, S. R. Brant,M. S. Silverberg, K. D. Taylor, M. M. Barmada, A. Bitton, T. Dassopoulos, L. W. Datta,T. Green, A. M. Griffiths, E. O. Kistner, M. T. Murtha, M. D. Regueiro, J. I. Rotter,L. P. Schumm, A. H. Steinhart, S. R. Targan, R. J. Xavier, NIDDK IBD Genetics Consortium,C. Libioulle, C. Sandor, M. Lathrop, J. Belaiche, O. Dewit, I. Gut, S. Heath,D. Laukens, M. Mni, P. Rutgeerts, A. V. Gossum, D. Zelenika, D. Franchimont, J. P.Hugot, M. de Vos, S. Vermeire, E. Louis, Belgian-French IBD Consortium, Wellcome TrustCase Control Consortium, L. R. Cardon, C. A. Anderson, H. Drummond, E. Nimmo,T. Ahmad, N. J. Prescott, C. M. Onnie, S. A. Fisher, J. Marchini, J. Ghori, S. Bumpstead,R. Gwilliam, M. Tremelling, P. Deloukas, J. Mansfield, D. Jewell, J. Satsangi, C. G.Mathew, M. Parkes, M. Georges, and M. J. Daly. Genome-<strong>wide</strong> association defines morethan 30 distinct susceptibility loci <strong>for</strong> Crohn’s disease. Nat. Genet., 40:955–962, 2008.W. T. Barry, A. B. Nobel, and F. A. Wright. A statistical framework <strong>for</strong> testing functionalcategories in microarray data. Ann. Appl. Stat., 2:286–315, 2008.D. P. Bartel. MicroRNAs: Genomics, Biogenesis, Mechanism, and Function. Cell, 281–297,2004.P. Bebbington and R. Ramana. The epidemiology <strong>of</strong> bipolar affective disorder. Soc. PsychiatryPsychiatr. Epidemiol., 30:279–292, 1995.J. Bedo, C. Sanderson, and A. Kowalczyk. An Efficient Alternative to SVM Based RecursiveFeature Elimination with Applications in Natural Language Processing and Bioin<strong>for</strong>matics.In A. Sattar and B. H. Kang, editors, Proc. Aust. Joint Conf. AI, 2006.256


BibliographyY. Benjamini and Y. Hochberg. Controlling the False Discovery Rate: a Practical and PowerfulApproach to Multiple Testing. J. R. Statist. Soc., 57:289–300, 1995.A. H. Bild, G. Yao, J. T. Chang, Q. Wang, . Potti, D. Chasse, M.-B. Joshi, D. Harpole, J. M.Lancaster, A. Berchuck, J. A. Olson, J. R. Marks, H. K. Dressman, M. West, and J. R.Nevins. Oncogenic pathway signatures in <strong>human</strong> cancers as a guide to targeted therapies.Nature, 439(7074):353–357, 2006.H. Binder and M. Schumacher. Adapting Prediction Error Estimates <strong>for</strong> Biased ComplexitySelection in High-Dimensional Bootstrap Samples. Statist. Appl. Genet. Mol. Biol., 7, 2008.D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation. J. Mach. Learn. Res.,3:993–1022, 2003.B. J. Blencowe. Alternative Splicing: New Insights from Global Analyses. Cell, 126:37–47,2006.T. H. Bø, B. Dysvik, and I. Jonassen. LSimpute: accurate estimation <strong>of</strong> missing values inmicroarray data with least squares methods. Nucleic Acid Res., 32:e34, 2004.W. Bodmer and C. Bonilla. Common and rare variants in multifactorial susceptibility tocommon diseases. Nat. Genet., 40:695–701, 2008.B. M. Bolstad, R. A. Irizarry, M. Åstrand, and T. P. Speed. A comparison <strong>of</strong> normalizationmethods <strong>for</strong> high density oligonucleotide array data based on variance and bias. Bioin<strong>for</strong>matics,19:185–193, 2003.J.-P. Bonnefont, F. Djouadi, C. Prip-Buus, S. Gobin, A. Munnich, and J. Bastin. Carnitinepalmitoyltransferases 1 and 2: biochemical, molecular and medical aspects. MolecularAspects <strong>of</strong> Medicine, 25:495–520, 2004.L. Bottou and Y. LeCun. Large scale online learning. In S. Thrun, L. Saul, and B. Schölkopf,editors, Advances in Neural In<strong>for</strong>mation Processing Systems 16. MIT Press, Cambridge,M. A, 2004.S. Boyd and L. Vandenbergh. Convex Optimization. Cambridge University Press, 2004.J. K. Bradley, A. Kyrola, D. Bickson, and C. Guestrin. Parallel Coordinate Descent <strong>for</strong>L1-Regularized Loss Minimization. In ICML 2011, Proceedings <strong>of</strong> the 28th InternationalConference on Machine Learning, 2011.L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996.L. Breiman. Random <strong>for</strong>ests. Machine Learning, 45:5–32, 2001.257


BibliographyH. Brentani, O. L. Caballero, A. A. Camargo, A. M. da Silva, W. A. da Silva, E. D. Neto,M. Grivet, A. Gruber, P. E. M. Guimaraes, W. Hide, C. Iseli, C. V. Jongeneel, J. Kelso,M. A. Nagai, E. P. B. Ojopi, et al. The generation and utilization <strong>of</strong> a cancer-orientedrepresentation <strong>of</strong> the <strong>human</strong> transcriptome by using expressed sequence tags. Proc. Natl.Acad. Sci., 100:13148–13423, 2003.S. R. Browning and B. L. Browning. Rapid and accurate haplotype phasing and missing-datainference <strong>for</strong> whole-<strong>genome</strong> association studies by use <strong>of</strong> localized haplotype clustering. Am.J. Hum. Genet., 81:1084–1097, 2007.S. R. Browning and B. L. Browning. Population structure can inflate SNP-based heritabilityestimates. American Journal <strong>of</strong> Human Genetics, 89:191–3; author reply 193–5, 2011.M. Buyse, S. Loi, L. van ’t Veer, G. Viale, M. Delorenzi, A. M. Glas, M. S. d’Assignies,J. Bergh, R. Lidereau, P. Ellis, A. Harris, J. Bogaerts, P. Therasse, A. Floore, M. Amakrane,F. Piette, E. Rutgers, C. Sotiriou, F. Cardoso, M. J. Piccart, and T. Consortium. Validationand clinical utility <strong>of</strong> a 70-gene prognostic signature <strong>for</strong> women with node-negative breastcancer. J. Natl. Cancer Inst., 98:1183–1192, 2006.C. Carlson, M. Eberle, M. Rieder, and Q. Yi. Selecting a maximally in<strong>for</strong>mative set <strong>of</strong> singlenucleotidepolymorphisms <strong>for</strong> association analyses using linkage disequilibrium. AmericanJournal <strong>of</strong> Human Genetics, 74:106–20, 2004.M. J. Carter, A. J. Lobo, S. P. Travis, and IBD Section, British Society <strong>of</strong> Gastroenterology.Guidelines <strong>for</strong> the management <strong>of</strong> inflammatory bowel disease in adults. Gut, 53 (Suppl5):V1–V16, 2004.K.-W. Chang, C.-J. Hsieh, and C.-J. Lin. Coordinate Descent Method <strong>for</strong> Large-scale L2-lossLinear Support Vector Machines. J. Mach. Learn. Res., 9:1369–1398, 2008.A. Chase, T. Ernst, A. Fiebig, A. Collins, F. Grand, P. Erben, A. Reiter, S. Schreiber, andN. C. P. Cross. TFG, a target <strong>of</strong> chromosome translocations in lymphoma and s<strong>of</strong>t tissuetumors, fuses to GPR128 in healthy individuals. Haematologica, 95:20–6, 2010.Y. Chen, J. Zhu, P.-Y. Lum, X. Yang, S. Pinto, D. J. MacNeil, C. Zhang, J. Lamb, S. Edwards,S. K. Sieberts, A. Leonardseon, L. W. Castelli, S. Wang, M.-F. Champy, B. Zhang,V. Emilsson, S. Doss, A. Ghazalpour, S. Horvath, T. A. Drake, A. J. Lusis, and E. E. Schadt.Variations in DNA elucidate molecular networks that cause disease. Nature, 452:429–435,2008.X. Chen, Q. Lin, S. Kim, J. G. Carbonell, and E. P. Xing. A smoothing proximal gradientmethod <strong>for</strong> general structured sparse regression. Ann. Appl. Statist., 2012. To appear.258


BibliographyL. Chin, J. N. Andersen, and P. A. Futreal. Cancer genomics: from discovery science topersonalized medicine. Nature Medicine, 17(3):297–303, 2011.H.-Y. Chuang, E. Lee, Y.-T. Liu, D. Lee, and T. Ideker. Network-based classification <strong>of</strong> breastcancer metastasis. Mol. Sys. Biol., 3, 2007.G. M. Clarke, C. a. Anderson, F. H. Pettersson, L. R. Cardon, A. P. Morris, and K. T.Zondervan. Basic statistical <strong>analysis</strong> in genetic case-control studies. Nature Protocols,6(2):121–33, 2011.D. G. Clayton. Prediction and interaction in complex disease genetics: experience in type 1diabetes. PLoS Genet., 5:e1000540, 2009.D. G. Clayton. Sex chromosomes and genetic association studies. Genome Med., 1:110, 2009.J. Cohen, P. Cohen, S. G. West, and L. S. Aiken. Applied Multiple Regression/CorrelationAnalysis <strong>for</strong> the Behavioral Sciences. Lawrence Erlbaum Associates, 3rd edition, 2003.W. Cookson, L. Liang, G. Abecasis, M. M<strong>of</strong>fatt, and M. Lathrop. Mapping complex diseasetraits with global gene expression. Nat. Rev. Genet., 10:184–194, 2009.H. J. Cordell. Detecting gene-gene interactions that underlie <strong>human</strong> diseases. Nat. Rev.Genet., 10:392–404, 2009.E. Cosgun, N. a. Limdi, and C. W. Duarte. High-dimensional pharmacogenetic prediction<strong>of</strong> a continuous trait using machine learning techniques with application to warfarin doseprediction in African Americans. Bioin<strong>for</strong>matics, 27(10):1384–9, 2011.C. Curtis, S. P. Shah, S.-F. Chin, G. Turashvili, O. M. Rueda, M. J. Dunning, D. Speed,A. G. Lynch, S. Samarajiwa, Y. Yuan, S. Gräf, G. Ha, G. Haffari, A. Bashashati, R. Russell,S. McKinney, C. Caldas, S. Aparicio, C. Curtis, J. D. Brenton, I. Ellis, D. Huntsman,S. Pinder, A. Purushotham, L. Murphy, H. Bardwell, Z. Ding, L. Jones, B. Liu,I. Papatheodorou, S. J. Sammut, G. Wishart, S. Chia, K. Gelmon, C. Speers, P. Watson,R. Blamey, A. Green, D. Macmillan, E. Rakha, C. Gillett, A. Grigoriadis, E. di Rinaldis,A. Tutt, M. Parisien, S. Troup, D. Chan, C. Fielding, A.-T. Maia, S. McGuire, M. Osborne,S. M. Sayalero, I. Spiteri, J. Hadfield, L. Bell, K. Chow, N. Gale, M. Kovalik, Y. Ng,L. Prentice, S. Tavaré, F. Markowetz, A. Langerø d, E. Provenzano, and A.-L. Bø rresenDale. The genomic and transcriptomic architecture <strong>of</strong> 2,000 breast tumours reveals novelsubgroups. Nature, pages 1–7, 2012.A. R. Dabney and J. D. Storey. Optimality driven nearest centroid classification from genomicdata. PLoS One, 2:e1002, 2007.259


BibliographyH. Dai, L. van’t Veer, J. Lamb, Y. D. He, M. Mao, B. M. Fine, R. Bernards, M. van deVijver, P. Deutsch, A. Sachs, R. Stoughton, and S. Friend. A cell proliferation signature isa marker <strong>of</strong> extremely poor outcome in a subpopulation <strong>of</strong> breast cancer patients. CancerRes., 65:4059–4066, 2005.I. Daubechies, M. Defrise, and C. D. Mol. An Iterative Thresholding Algorithm <strong>for</strong> LinearInverse Problems with a Sparsity Constraint. Comm. Pure Appl. Math., LVII:1413–1457,2004.C. Desmedt, F. Piette, S. Loi, Y. Wang, F. Lallemand, B. Haibe-Kains, G. Viale, M. Delorenzi,Y. Zhang, M. S. d’Assignies, J. Bergh, R. Lidereau, P. Ellis, A. L. Harris, J. G. Klijn, J. A.Foekens, F. Cardoso, M. J. Piccart, M. Buyse, C. Sotiriou, and T. Consortium. Strong timedependence <strong>of</strong> the 76-gene prognostic signature <strong>for</strong> node-negative breast cancer patients inthe TRANSBIG multicenter independent validation series. Clin. Cancer Res., 13:3207–3214, 2007.C. Desmedt, B. Haibe-Kains, P. Wirapati, M. Buyse, D. Larsimont, G. Bontempi, M. Delorenzi,M. Piccart, and C. Sotiriou. Biological processes associated with breast cancerclinical outcome depend on the molecular subtypes. Clin. Cancer Res., 14:5158–5165,2008.B. Devlin and K. Roeder. Genomic Control <strong>for</strong> Association Studies. Biometrics, 55:997–1004,1999.S. Doss, E. E. Schadt, T. A. Drake, and A. J. Lusis. Cis-acting expression quantitative traitloci in mice. Genome Res., 15:681–691, 2005.J. Downward. Targeting ras signalling pathways in cancer therapy. Nat. Rev. Cancer, 3:11–22,2003.P. C. A. Dubois, G. Trynka, L. Franke, K. A. Hunt, J. Romanos, A. Curtotti, A. Zhernakova,G. A. R. Heap, R. Ádány, et al. Multiple common variants <strong>for</strong> celiac disease influencingimmune gene expression. Nat. Genet., 42:295–304, 2010.J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto the l 1 -ball <strong>for</strong> learning in high dimensions. In ICML 2008, Proceedings <strong>of</strong> the 25th InternationalConference on Machine Learning, 2008.J. Dupuis, C. Langenberg, I. Prokopenko, R. Saxena, N. Soranzo, A. U. Jackson, E. Wheeler,N. L. Glazer, N. Bouatia-Naji, A. L. Gloyn, C. M. Lindgren, R. Mägi, A. P. Morris,J. Randall, T. Johnson, P. Elliott, D. Rybin, G. Thorleifsson, V. Steinthorsdottir, P. Henneman,H. Grallert, A. Dehghan, J. J. Hottenga, C. S. Franklin, P. Navarro, K. Song,A. Goel, J. R. B. Perry, J. M. Egan, T. Lajunen, N. Grarup, T. Sparsø, A. Doney, B. F.260


BibliographyVoight, H. M. Stringham, M. Li, S. Kanoni, P. Shrader, C. Cavalcanti-Proença, M. Kumari,L. Qi, N. J. Timpson, C. Gieger, C. Zabena, G. Rocheleau, E. Ingelsson, P. An,J. O’Connell, J. Luan, A. Elliott, S. a. McCarroll, F. Payne, R. M. Roccasecca, F. Pattou,P. Sethupathy, K. Ardlie, Y. Ariyurek, B. Balkau, P. Barter, J. P. Beilby, Y. Ben-Shlomo,R. Benediktsson, A. J. Bennett, S. Bergmann, M. Bochud, E. Boerwinkle, A. Bonnefond,L. L. Bonnycastle, K. Borch-Johnsen, Y. Böttcher, E. Brunner, S. J. Bumpstead, G. Charpentier,Y.-D. I. Chen, P. Chines, R. Clarke, L. J. M. Coin, M. N. Cooper, M. Cornelis,G. Craw<strong>for</strong>d, L. Crisponi, I. N. M. Day, E. J. C. de Geus, J. Delplanque, C. Dina, M. R.Erdos, A. C. Fedson, A. Fischer-Rosinsky, N. G. Forouhi, C. S. Fox, R. Frants, M. G.Franzosi, P. Galan, M. O. Goodarzi, J. Graessler, C. J. Groves, S. Grundy, R. Gwilliam,U. Gyllensten, S. Hadjadj, G. Hallmans, N. Hammond, X. Han, A.-L. Hartikainen, N. Hassanali,C. Hayward, S. C. Heath, S. Hercberg, C. Herder, A. a. Hicks, D. R. Hillman,A. D. Hingorani, A. H<strong>of</strong>man, J. Hui, J. Hung, B. Isomaa, P. R. V. Johnson, T. Jø rgensen,A. Jula, M. Kaakinen, J. Kaprio, Y. A. Kesaniemi, M. Kivimaki, B. Knight, S. Koskinen,P. Kovacs, K. O. Kyvik, G. M. Lathrop, D. a. Lawlor, O. Le Bacquer, C. Lecoeur,Y. Li, V. Lyssenko, R. Mahley, M. Mangino, A. K. Manning, M. T. Martínez-Larrad,J. B. McAteer, L. J. McCulloch, R. McPherson, C. Meisinger, D. Melzer, D. Meyre, B. D.Mitchell, M. a. Morken, S. Mukherjee, S. Naitza, N. Narisu, M. J. Neville, B. a. Oostra,M. Orrù, R. Pakyz, C. N. a. Palmer, G. Paolisso, C. Pattaro, D. Pearson, J. F. Peden,N. L. Pedersen, M. Perola, A. F. H. Pfeiffer, I. Pichler, O. Polasek, D. Posthuma, S. C.Potter, A. Pouta, M. a. Province, B. M. Psaty, W. Rathmann, N. W. Rayner, K. Rice,S. Ripatti, F. Rivadeneira, M. Roden, O. Rolandsson, A. Sandbaek, M. Sandhu, S. Sanna,A. A. Sayer, P. Scheet, L. J. Scott, U. Seedorf, S. J. Sharp, B. Shields, G. Sigurethsson,E. J. G. Sijbrands, A. Silveira, L. Simpson, A. Singleton, N. L. Smith, U. Sovio, A. Swift,H. Syddall, A.-C. Syvänen, T. Tanaka, B. Thorand, J. Tichet, A. Tönjes, T. Tuomi, A. G.Uitterlinden, K. W. van Dijk, M. van Hoek, D. Varma, S. Visvikis-Siest, V. Vitart, N. Vogelzangs,G. Waeber, P. J. Wagner, A. Walley, G. B. Walters, K. L. Ward, H. Watkins,M. N. Weedon, S. H. Wild, G. Willemsen, J. C. M. Witteman, J. W. G. Yarnell, E. Zeggini,D. Zelenika, B. Zethelius, G. Zhai, J. H. Zhao, M. C. Zillikens, I. B. Borecki, R. J. F.Loos, P. Meneton, P. K. E. Magnusson, D. M. Nathan, G. H. Williams, A. T. Hattersley,K. Silander, V. Salomaa, G. D. Smith, S. R. Bornstein, P. Schwarz, J. Spranger, F. Karpe,A. R. Shuldiner, C. Cooper, G. V. Dedoussis, M. Serrano-Ríos, A. D. Morris, L. Lind, L. J.Palmer, F. B. Hu, P. W. Franks, S. Ebrahim, M. Marmot, W. H. L. Kao, J. S. Pankow,M. J. Sampson, J. Kuusisto, M. Laakso, T. Hansen, O. Pedersen, P. P. Pramstaller, H. E.Wichmann, T. Illig, I. Rudan, A. F. Wright, M. Stumvoll, H. Campbell, J. F. Wilson, R. N.Bergman, T. a. Buchanan, F. S. Collins, K. L. Mohlke, J. Tuomilehto, T. T. Valle, D. Altshuler,J. I. Rotter, D. S. Siscovick, B. W. J. H. Penninx, D. I. Boomsma, P. Deloukas,T. D. Spector, T. M. Frayling, L. Ferrucci, A. Kong, U. Thorsteinsdottir, K. Stefansson,261


BibliographyC. M. van Duijn, Y. S. Aulchenko, A. Cao, A. Scuteri, D. Schlessinger, M. Uda, A. Ruokonen,M.-R. Jarvelin, D. M. Waterworth, P. Vollenweider, L. Peltonen, V. Mooser, G. R.Abecasis, N. J. Wareham, R. Sladek, P. Froguel, R. M. Watanabe, J. B. Meigs, L. Groop,M. Boehnke, M. I. McCarthy, J. C. Florez, and I. Barroso. New genetic loci implicatedin fasting glucose homeostasis and their impact on type 2 diabetes risk. Nature Genetics,42(2):105–16, 2010.a. B. Dydensborg, a. a. N. Rose, B. J. Wilson, D. Grote, M. Paquet, V. Giguère, P. M. Siegel,and M. Bouchard. GATA3 inhibits breast cancer growth and pulmonary breast cancermetastasis. Oncogene, 28:2634–42, 2009.F. Eckhardt, J. Lewin, R. Cortese, V. K. Rakyan, J. Attwood, M. Burger, J. Burton, T. V.Cox, R. Davies, T. A. Down, C. Haefliger, R. Horton, K. Howe, D. Jackson, J. Kunde,C. Koenig, J. Liddle, D. Niblett, T. Otto, R. Pettett, S. Seemann, C. Thompson, T. West,J. Rogers, A. Olek, K. Berlin, and S. Beck. DNA methylation pr<strong>of</strong>iling <strong>of</strong> <strong>human</strong> chromosomes6, 20 and 22. Nat. Genet., 38:1378–1385, 2006.R. Edgar, M. Domrachev, and A. E. Lash. Gene Expression Omnibus: NCBI gene expressionand hybridization array data repository. Nucleic Acid Res., 30:207–210, 2002.B. Efron and R. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, 1993.B. Efron and R. Tibshirani. On testing the significance <strong>of</strong> sets <strong>of</strong> genes. Annal. Stat., 1:107–129, 2007.B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least Angle Regression. Ann. Stat.,32:407–499, 2004.E. E. Eichler, J. Flint, G. Gibson, A. Kong, S. M. Leal, J. H. Moore, and J. H. Nadeau.Missing heritability and strategies <strong>for</strong> finding the underlying causes <strong>of</strong> complex disease.Nat. Rev. Genet., 11:446–450, 2010.L. Ein-Dor, I. Kela, G. Getz, D. Givol, and E. Domany. Outcome signature genes in breastcancer: is there a unique set? Bioin<strong>for</strong>matics, 21:171–178, 2005.L. Ein-Dor, O. Zuk, and E. Domany. Thousands <strong>of</strong> samples are needed to generate a robustgene list <strong>for</strong> predicting outcome in cancer. Proc. Natl. Acad. Sci., 103:5923–5928, 2006.H. Eleftherohorinou, V. Wright, C. Hoggart, A. L. Hartikainen, et al. Pathway <strong>analysis</strong> <strong>of</strong>GWAS provides new insights into genetic susceptibility to 3 inflammatory diseases. PLoSONE, 4:e8068, 2009.V. Emilsson, G. Thorleifsson, B. Zhang, A. S. Leonardson, F. Zink, J. Zhu, S. Carlson,A. Helgason, G. B. Walters, S. Gunnarsdottir, M. Mouy, V. Steinthorsdottir, G. H. Eiriksdottir,G. Bjornsdottir, I. Reynisdottir, D. Gudbjbartsson, A. Helgadottir, A. Jonasdottir,262


BibliographyY. Freund and R. E. Schapire. A desicion-theoretic generalization <strong>of</strong> on-line learning and anapplication to boosting. Journal <strong>of</strong> Computer and System Sciences, 55(1):119–139, 1997.J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view <strong>of</strong>boosting. The Annals <strong>of</strong> Statistics, 28:337–407, 2000.J. Friedman, T. Hastie, H. Höfling, and R. Tibshirani. Pathwise coordinate optimization.Ann. Appl. Statist., 1:302–332, 2007.J. Friedman, T. Hastie, and R. Tibshirani. Regularization Paths <strong>for</strong> Generalized LinearModels via Coordinate Descent. J. Stat. S<strong>of</strong>t., 33, 2010.J. Friedman. Greedy Function Approximation: A Gradient Boosting Machine. Annals <strong>of</strong>Statistics, 29:1189–1232, 2001.W. J. Fu. Penalized regressions: the bridge versus the lasso. J. Comput. Graph. Stat.,7:397–416, 1998.A. Gelman and J. Hill. Data Analysis Using Regression and Multilevel/Hierarchical Models.Cambridge University Press, 2007.G. Gibson. Rare and common variants: twenty arguments. Nat. Rev. Genet., 13(2):135–45,2011.Y. Gilad, S. A. Rifkin, and J. K. Pritchard. Revealing the architecture <strong>of</strong> gene regulation:the promise <strong>of</strong> eQTL studies. Trends. Genet., 24:408–415, 2008.J. J. Goeman and P. Bühlmann. Analyzing gene expression data in terms <strong>of</strong> gene sets:methodological issues. Bioin<strong>for</strong>matics, 23:980–987, 2007.J. Goeman. penalized: L1 (lasso) and L2 (ridge) penalized estimation in GLMs and in theCox model, 2008. R package version 0.9-22.A. D. Goldberg, C. D. Allis, and E. Bernstein. Epigenetics: A Landscape Takes Shape. Cell,128:635–638, 2007.T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller,M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. MolecularClassification <strong>of</strong> Cancer: Class Discovery and Class Prediction by Gene ExpressionMonitoring. Science, 286:531–537, 1999.H. H. H. Göring, J. E. Curran, M. P. Johnson, T. D. Dyer, J. Charlesworth, S. A. Cole, J. B. M.Jowett, L. J. Abraham, D. L. Rainwater, A. G. Comuzzie, M. C. Mahaney, L. Almasy,J. W. MacCluer, A. H. Kissebah, G. R. Collier, E. K. Moses, and J. Blangero. Discovery<strong>of</strong> expression QTLs using large-scale transcriptional pr<strong>of</strong>iling in <strong>human</strong> lymphocytes. Nat.Genet., 39:1208–1216, 2007.264


BibliographyV. Guerrero. Time-series <strong>analysis</strong> supported by power trans<strong>for</strong>mations. Journal <strong>of</strong> Forecasting,12(1):37–48, 1993.Z. Guo, T. Zhang, X. Li, Q. Wang, J. Xu, H. Yu, J. Zhu, H. Wang, C. Wang, E. J. Topol,Q. Wang, and S. Rao. Towards precise classification <strong>of</strong> cancers based on robust genefunctional expression pr<strong>of</strong>iles. BMC Bioinfo., 6:article 58, 2005.I. Guyon and A. Elisseeff. An Introduction to Variable and Feature Selection. J. Mach. Learn.Res., 3:1157–1182, 2003.I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene Selection <strong>for</strong> Cancer Classificationusing Support Vector Machines. Mach. Learn., 46:389–422, 2002.B. Haibe-Kains, C. Desmedt, C. Sotiriou, and G. Bontempi. A comparative study <strong>of</strong> survivalmodels <strong>for</strong> breast cancer prognostication based on microrarray data: does a single genebeat them all? Bioin<strong>for</strong>matics, 24:2200–2208, 2008.J. A. Hanley and B. J. McNeil. The Meaning and Use <strong>of</strong> the Area under a Receiver OperatingCharacteristic (ROC) Curve. Radiology, 143:29–36, 1982.T. Hastie, R. Tibshirani, and J. Friedman. The Elements <strong>of</strong> Statistical Learning. Springer,2nd edition, 2009.T. Hastie, R. Tibshirani, B. Narasimhan, and G. Chu. pamr: Pam: prediction <strong>analysis</strong> <strong>for</strong>microarrays, 2009. R package version 1.42.0.P. W. Hedrick. Genetics <strong>of</strong> Poplations. Jones & Bartlett, 4th edition, 2009.L. Hernández, M. Pinyol, S. Hernández, S. Beà, K. Pul<strong>for</strong>d, a. Rosenwald, L. Lamant,B. Falini, G. Ott, D. Y. Mason, G. Delsol, and E. Campo. TRK-fused gene (TFG) isa new partner <strong>of</strong> ALK in anaplastic large cell lymphoma producing two structurally differentTFG-ALK translocations. Blood, 94:3265–8, 1999.W. G. Hill, M. E. Goddard, and P. M. Visscher. Data and theory point to mainly additivegenetic variance <strong>for</strong> complex traits. PLoS Genetics, 4(2):e1000008, 2008.L. A. Hindorff, P. Sethupathy, H. A. Junkins, E. M. Ramos, J. P. Mehta, F. S. Collins, andT. A. Manolio. Potential etiologic and functional implications <strong>of</strong> <strong>genome</strong>-<strong>wide</strong> associationloci <strong>for</strong> <strong>human</strong> diseases and traits. Proc. Natl. Acad. Sci., 106:9362–9367, 2009.D. A. Hinds, L. L. Stuve, G. B. Nilsen, E. Halperin, E. Eskin, D. G. Ballinger, K. A. Frazer,and D. R. Cox. Whole-<strong>genome</strong> patterns <strong>of</strong> common DNA variation in three <strong>human</strong> populations.Science, 307(5712):1072–9, 2005.265


BibliographyC. J. Hoggart, J. C. Whittaker, M. D. Iorio, and D. J. Balding. Simultaneous <strong>analysis</strong> <strong>of</strong>all SNPs in <strong>genome</strong>-<strong>wide</strong> and re-sequencing association studies. PLoS Genet., 4:e1000130,2008.M. Hollander and D. A. Wolfe. Nonparametric Statistical Methods. Wiley-Interscience, 2ndedition, 1999.E. Holmes, I. D. Wilson, and J. K. Nicholson. Metabolic Phenotyping in Health and Disease.Cell, 134:714–717, 2008.B. N. Howie, P. Donnelly, and J. Marchini. A Flexible and Accurate Genotype ImputationMethod <strong>for</strong> the Next Generation <strong>of</strong> Genome-Wide Association Studies. PLoS Genet.,5:e1000529, 2009.B. Howie, J. Marchini, and M. Stephens. Genotype imputation with thousands <strong>of</strong> <strong>genome</strong>s.G3: Genes, Genomes, Genetics, 1(6):457–70, 2011.V. Iacobazzi, F. Invernizzi, S. Baratta, R. Pons, W. Chung, B. Garavaglia, C. Dionisi-Vici,A. Ribes, R. Parini, M. D. Huertas, S. Roldan, G. Lauria, F. Palmieri, and F. Taroni.Molecular and functional <strong>analysis</strong> <strong>of</strong> SLC25A20 mutations causing carnitine-acylcarnitinetranslocase deficiency. Human Mutation, 24:312–20, 2004.M. Inouye, J. Kettunen, P. Soininen, S. Ripatti, L. S. Kumpula, E. Hämäläinen, P. Jousilahti,A. J. Kangas, S. Männistö, M. J. Savolainen, A. Jula, J. Leiviskä, A. Palotie, V. Salomaa,M. Perola, M. Ala-Korpela, and L. Peltonen. Metabonomic, transcriptomic, and geneticvariation <strong>of</strong> a population cohort. Mol. Sys. Biol., 6:441, 2010.M. Inouye, K. Silander, E. Hamalainen, V. Salomaa, K. Harald, et al. An Immune ResponseNetwork Associated with Blood Lipid Levels. PLoS Genet., 6:e1001113, 2010.International HapMap 3 Consortium. Integrating common and rare genetic variation in diverse<strong>human</strong> populations. Nature, 467:52–58, 2010.International HapMap Consortium. A haplotype map <strong>of</strong> the <strong>human</strong> <strong>genome</strong>. Nature,437:1299–1320, 2005.International HapMap Consortium. A second generation <strong>human</strong> haplotype map <strong>of</strong> over 3.1million SNPs. Nature, 449:851–861, 2007.R. A. Irizarry, B. M. Bolstad, F. Collin, L. M. Cope, B. Hobbs, and T. P. Speed. Summaries<strong>of</strong> Affymetrix GeneChip probe level data. Nucleic Acid Res., 31:e15, 2003.R. A. Irizarry, B. Hobbs, F. Collin, Y. D. Beazer-Barclay, K. J. Antonellis, U. Scherf, andT. P. Speed. Exploration, Normalization, and Summaries <strong>of</strong> High Density OligonucleotideArray Probe Level Data. Biostatistics, 4:249–264, 2003.266


BibliographyA. V. Ivshina, J. George, O. Senko, B. Mow, T. C. Putti, J. Smeds, T. Lindahl, Y. Pawitan,P. Hall, H. Nordgren, J. E. Wong, E. T. Liu, J. Bergh, V. A. Kuznetsov, and L. D. Miller.Genetic reclassification <strong>of</strong> histologic grade delineates new clinical subtypes <strong>of</strong> breast cancer.Cancer Res., 66:10292–10301, 2006.L. Jacob, G. Obozinski, and J.-P. Vert. Group Lasso with Overlap and Graph Lasso. In ICML2009, Proceedings <strong>of</strong> <strong>of</strong> the 26th International Conference on Machine Learning, 2009.R. C. Jansen and J.-P. Nap. Genetical genomics: the added value from segregation. TrendsGenet., 17:388–393, 2001.T. Kam-Thong, B. Pütz, N. Karbalai, B. Müller-Myhsok, and K. Borgwardt. Epistasis detectionon quantitative phenotypes by exhaustive enumeration using GPUs. Bioin<strong>for</strong>matics,27(13):i214–i221, 2011.M. Kanehisa and S. Goto. KEGG: Kyoto Encyclopedia <strong>of</strong> Genes and Genomes. Nucleic AcidRes., 28:27–30, 2000.M. Kanehisa, S. Goto, M. Furumichi, M. Tanabe, and M. Hirakawa. KEGG <strong>for</strong> representationand <strong>analysis</strong> <strong>of</strong> molecular networks involving diseases and drugs. Nucl. Acid. Res., 38:D355–D360, 2010.A. Karatzoglou, A. Smola, K. Hornik, and A. Zeileis. kernlab – an S4 package <strong>for</strong> kernelmethods in R. Journal <strong>of</strong> Statistical S<strong>of</strong>tware, 11(9):1–20, 2004.S.-Y. Kim and Y.-S. Kim. A gene sets approach <strong>for</strong> identifying prognostic gene signatures <strong>for</strong>outcome prediction. BMC Genomics, 9, 2008.S. Kim and E. P. Xing. Statistical estimation <strong>of</strong> correlated <strong>genome</strong> associations to a quantitativetrait network. PLoS Genet., 5:e1000587, 2009.S. Kim and E. P. Xing. Exploiting Genome Structure in Association Analysis. J. Comput.Biol., 18:1–16, 2011.H. Kim, G. H. Golub, and H. Park. Missing value estimation <strong>for</strong> DNA microarray geneexpression data: local least squares imputation. Bioin<strong>for</strong>matics, 21:187–98, 2005.P.-M. Kloetzel. The proteasome and MHC class I antigen processing. Biochimica et BiophysicaActa, 1695:225–233, 2004.K. Knight and W. Fu. Asymptotics <strong>for</strong> lasso-type estimators. Ann. Stat., 28:1356–1378, 2000.C. Kooperberg, M. Leblanc, and V. Obenchain. Risk prediction using <strong>genome</strong>-<strong>wide</strong> associationstudies. Genet. Epidemiol., 34:643–652, 2010.267


BibliographyC. M. Lindgren, I. M. Heid, J. C. Randall, C. Lamina, V. Steinthorsdottir, L. Qi, E. K. Speliotes,G. Thorleifsson, C. J. Willer, B. M. Herrera, A. U. Jackson, N. Lim, P. Scheet,N. Soranzo, N. Amin, Y. S. Aulchenko, J. C. Chambers, A. Drong, J. Luan, H. N.Lyon, F. Rivadeneira, S. Sanna, N. J. Timpson, M. C. Zillikens, J. H. Zhao, P. Almgren,S. Bandinelli, A. J. Bennett, R. N. Bergman, L. L. Bonnycastle, S. J. Bumpstead, S. J.Chanock, L. Cherkas, P. S. Chines, L. Coin, C. Cooper, G. Craw<strong>for</strong>d, A. Doering, A. Dominiczak,A. S. F. Doney, S. Ebrahim, P. Elliott, M. R. Erdos, K. Estrada, L. Ferrucci,G. Fischer, N. G. Forouhi, C. Gieger, H. Grallert, C. J. Groves, S. Grundy, C. Guiducci,D. Hadley, A. Hamsten, A. S. Havulinna, A. H<strong>of</strong>man, R. Holle, J. W. Holloway, T. Illig,B. Isomaa, L. C. Jacobs, K. Jameson, P. Jousilahti, F. Karpe, J. Kuusisto, J. Laitinen, G. M.Lathrop, D. A. Lawlor, M. Mangino, W. L. McArdle, T. Meitinger, M. Morken, A. P. Morris,P. Munroe, N. Narisu, A. Nordstrm, P. Nordstrm, B. A. Oostra, C. N. Palmer, F. Payne,J. F. Peden, I. Prokopenko, F. Renstrm, A. Ruokonen, V. Salomaa, M. S. Sandhu, L. J.Scott, A. Scuteri, K. Silander, K. Song, X. Yuan, H. M. Stringham, A. J. Swift, T. Tuomi,M. Uda, P. Vollenweider, G. Waeber, C. Wallace, G. B. Walters, M. N. Weedon, W. Consortium,J. C. Witteman, C. Zhang, W. Zhang, M. J. Caulfield, F. S. Collins, G. D. Smith,I. N. Day, P. W. Franks, A. T. Hattersley, F. B. Hu, M. R. Jarvelin, A. Kong, J. S. Kooner,M. Laakso, E. Lakatta, V. Mooser, A. D. Morris, L. Peltonen, N. J. Samani, T. D. Spector,D. P. Strachan, T. Tanaka, J. Tuomilehto, A. G. Uitterlinden, C. M. van Duijn, N. J. Wareham,H. Watkins, P. Consortia, D. M. Waterworth, M. Boehnke, P. Deloukas, L. Groop,D. J. Hunter, U. Thorsteinsdottir, D. Schlessinger, H. E. Wichmann, T. M. Frayling, G. R.Abecasis, J. N. Hirschhorn, R. J. Loos, K. Stefansson, K. L. Mohlke, I. Barroso, M. I. Mccarthy,and G. Consortium. Genome-<strong>wide</strong> association scan meta-<strong>analysis</strong> identifies threeloci influencing adiposity and fat distribution. PLoS Genet., 5:e1000508, 2009.B. Liu, L. Liu, A. Tsykin, G. Goodall, J. Green, M. Zhu, C. Kim, and J. Li. Identifying functionalmiRNA-mRNA regulatory modules with correspondence latent dirichlet allocation.Bioin<strong>for</strong>matics, 26:3105–3111, 2010.B. A. Logsdon, G. E. H<strong>of</strong>fman, and J. G. Mezey. A variational Bayes algorithm <strong>for</strong> fast andaccurate multiple locus <strong>genome</strong>-<strong>wide</strong> association <strong>analysis</strong>. BMC Bioinfo., 11:58, 2010.K. E. Lohmueller, C. L. Pearce, M. Pike, E. S. Lander, and J. N. Hirschhorn. Meta-anslysis<strong>of</strong> genetic association studies supports a contribution <strong>of</strong> common variants to susceptibilityto common disease. Nat. Genet., 33:177–182, 2003.S. Loi, B. Haibe-Kains, C. Desmedt, F. Lallemand, A. M. Tutt, C. Gillet, P. Ellis, A. Harris,J. Bergh, J. A. Foekens, J. G. M. Klijn, D. Larsimont, M. Buyse, G. Bontempi, M. Delorenzi,M. J. Piccart, and C. Sotiriou. Definition <strong>of</strong> clinically distinct molecular subtypesin estrogen receptor-positive breast carcinomas through genomic grade. J. Clin. Oncol.,25:1239–1246, 2007.269


BibliographyT. A. Mckinsey, K. Kuwahara, S. Bezprozvannaya, and E. N. Olson. Responsiveness tothe Ankyrin-Repeat Proteins ANKRA2 and RFXANK. Molecular Biology <strong>of</strong> the Cell,17(January):438–447, 2006.G. J. McLachlan, K.-A. Do, and C. Ambroise. Analyzing Microarray Gene Expression Data.Wiley Interscience, 2004.D. W. Mehlman, U. L. Sheperd, and D. A. Kelt. Bootstrapping Principal ComponentsAnalysis: A Comment. Ecology, 76:640–643, 1995.L. Meier, S. van de Geer, and P. Bühlmann. The group lasso <strong>for</strong> logistic regression. J. R.Statist. Soc. B, 70:53–71, 2008.N. Meinshausen and P. Bühlmann. High dimensional graphs and variable selection with thelasso. Annal. Stat., 34:1436–1462, 2006.N. Meinshausen and P. Bühlmann. Stability selection. J. Royal Soc. Stats. B, 72:417–473,2010.I. Mérida, A. Avila-Flores, and E. Merino. Diacylglycerol kinases: at the hub <strong>of</strong> cell signalling.The Biochemical journal, 409(1):1–18, 2008.S. Michiels, S. Koscielny, and C. Hill. Prediction <strong>of</strong> cancer outcome with microarrays: amultiple random validation study. The Lancet, 365:488–492, 2005.C. Miranda, E. Roccato, G. Raho, S. Pagliardini, M. A. Pierotti, and A. Greco. The TFGProtein, Involved in Oncogenic Rearrangements, Interacts With TANK and NEMO, TwoProteins Involved in the NF-kB Pathway. Journal <strong>of</strong> Cellular Physiology, 160:154–160,2006.B. Modrek and C. Lee. A genomic view <strong>of</strong> alternative splicing. Nat. Genet., 30:13–19, 2002.J. H. Moore and S. M. Williams. Epistasis and Its Implications <strong>for</strong> Personal Genetics. Am.J. Hum. Genet., 85:309–320, 2009.J. H. Moore. A global view <strong>of</strong> epistasis. Nat. Genet., 37:13–14, 2005.V. K. Mootha, C. M. Lindgren, K.-F. Eriksson, A. Subramanian, S. Sihag, J. Lehar,P. Puigserver, E. Carlsson, M. Ridderstråle, E. Laurila, N. Houstis, M. J. Daly, N. Patterson,J. P. Mesirov, T. R. Golub, P. Tamayo, B. Spiegelman, E. S. Lander, J. N. Hirschhorn,D. Altshuler, and L. C. Groop. PGC-1α-responsive genes involved in oxidative phosphorylationare coordinately downregulated in <strong>human</strong> diabetes. Nat. Genet., 34:267–273, 2003.M. Morley, C. M. Molony, T. M. Weber, J. L. Devlin, K. G. Ewens, R. S. Spielman, and V. G.Cheung. Genetic <strong>analysis</strong> <strong>of</strong> <strong>genome</strong>-<strong>wide</strong> variation in <strong>human</strong> gene expression. Nature,430:743–747, 2004.271


BibliographyJ. D. Mosley and R. A. Keri. Cell cycle correlated genes dictate the prognostic power <strong>of</strong>breast cancer gene lists. BMC Med. Genom., 1:11, 2008.P. W. Mueller, J. J. Rogus, P. A. Cleary, Y. Zhao, A. M. Smiles, M. W. Steffes, et al. Genetics<strong>of</strong> Kidneys in Diabetes (GoKinD) study: a genetics collection available <strong>for</strong> identifyinggenetic susceptibility factors <strong>for</strong> diabetic nephropathy in type 1 diabetes. J. Am. Soc.Nephrol., 17:1782–1790, 2006.S. Myers, L. Bottolo, C. Freeman, G. McVean, and P. Donnelly. A fine-scale map <strong>of</strong> recombinationrates and hotspots across the <strong>human</strong> <strong>genome</strong>. Science, 310:321–324, 2005.NHS. Clinical and Health Outcomes Knowledge Base Financial Year 2008–2009, 2010. accessedMarch 3rd, 2011.J. K. Nicholson and J. C. Lindon. Metabonomics. Nature, 455:1054–1056, 2008.L. Nisticó, C. Fagnani, I. Coto, S. Percopo, R. Cotichini, M. G. Limongelli, F. Paparo,S. D’Alfonso, M. Giordano, C. Sferlazzas, G. Magazzú, P. Momigliano-Richiardi, L. Greco,and M. A. Stazi. Concordance, disease progression, and heritability <strong>of</strong> coeliac disease inItalian twins. Gut, 55:803–808, 2006.J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 2nd edition, 2006.J. O. Ogutu, H.-P. Piepho, and T. Schulz-Streeck. A comparison <strong>of</strong> random <strong>for</strong>ests, boostingand support vector machines <strong>for</strong> genomic selection. BMC proceedings, 5 Suppl 3(Suppl3):S11, 2011.K. L. Ong, B. M. Y. Cheung, Y. B. Man, C. P. Lau, and K. S. L. Lam. Prevalence, Awareness,Treatment, and Control <strong>of</strong> Hypertension Among United States Adults 1999-2004.Hypertension, 49:69–75, 2007.J. F. Oram and R. M. Lawn. ABCA1 : the gatekeeper <strong>for</strong> eliminating excess tissue cholesterol.Journal Of Lipid Research, 42:1173–1179, 2001.M. Osborne, B. Presnell, and B. Turlach. A new approach to variable selection in least squaresproblems. IMA J. Numer. Anal., 20:389–404, 2000.M. R. Osborne, B. Presnell, and B. Turlach. On the lasso and its dual. J. Comput. Graph.Stat., 9:319–337, 2000.S. Pagant and E. Miller. Trans<strong>for</strong>ming ER exit: protein secretion meets oncogenesis. NatureCell Biology, 13:525–6, 2011.N. Patterson, A. L. Price, and D. Reich. Population Structure and Eigen<strong>analysis</strong>. PLoSGenet., 2:e190, 2006.272


BibliographyL. Pusztai, C. Mazouni, K. Anderson, Y. Wu, and W. F. Symmans. Molecular classification<strong>of</strong> breast cancer: limitations and potential. The Oncologist, 11:868–877, 2006.R Development Core Team. R: A Language and Environment <strong>for</strong> Statistical Computing. RFoundation <strong>for</strong> Statistical Computing, Vienna, Austria, 2011. ISBN 3-900051-07-0.N. Rabbee and T. P. Speed. A genotype calling algorithm <strong>for</strong> affymetrix SNP arrays. Bioin<strong>for</strong>matics,22:7–12, 2006.S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukherjee, C. H. Yeang, M. Angelo, C. Ladd,M. Reich, E. Latulippe, J. P. Mesirov, T. Poggio, W. Gerald, M. Loda, E. S. Lander, andT. R. Golub. Multiclass cancer diagnosis using tumor gene expression signatures. Proc.Natl. Acad. Sci., 98:15149–15154, 2001.J. O. Ramsay and B. W. Silverman. Functional Data Analysis. Springer, 2nd edition, 2006.D. E. Reich and E. S. Lander. On the allelic spectrum <strong>of</strong> <strong>human</strong> disease. Trends Genet.,17:502–510, 2001.G. Ridgeway. The state <strong>of</strong> boosting. Computing Science and Statistics, 31:172–181, 1999.G. Ridgeway. gbm: Generalized Boosted Regression Models, 2012. R package version 1.6-3.2.M. V. Rockman. Reverse engineering the genotype-phenotype map with natural geneticvariation. Nature, 456:738–744, 2008.U. Roshan, S. Chikkagoudar, Z. Wei, K. Wang, and H. Hakonarson. Ranking causal variantsand associated regions in <strong>genome</strong>-<strong>wide</strong> association studies by the support vector machineand random <strong>for</strong>est. Nucl. Acids Res., 39:e62, 2011.R. Savage, Z. Ghahramani, J. Griffin, B. D. L. Cruz, and D. Wild. Discovering transcriptionalmodules by Bayesian data integration. Bioin<strong>for</strong>matics, 26:i158–i167, 2010.E. E. Schadt, J. Lamb, X. Yang, J. Zhu, S. Edwards, D. Guhathakurta, S. K. Sieberts,S. Monks, M. Reitman, C. Zhang, P. Y. Lum, A. Leonardson, R. Thieringer, J. M. Metzger,L. Yang, J. Castle, H. Zhu, S. F. Kash, T. A. Drake, A. Sachs, and A. J. Lusis. An integrativegenomics approach to infer causal associations between gene expression and disease. Nat.Genet., 17:710–717, 2005.R. Schilsky. Personalized medicine in oncology: the future is now. Nature Reviews DrugDiscovery, 9:363–366, 2010.M. Schmidt, D. Böhm, C. von Törne, E. Steiner, A. Puhl, H. Pilch, H.-A. Lehr, J. G.Hengstler, J. Kölbl, and M. Gehrmann. The Humoral Immune System Has a Key PrognosticImpact in Node-Negative Breast Cancer. Cancer Res., 68:5405–5413, 2008.274


BibliographyE. Schneider, M. Rolli-derkinderen, M. Arock, M. Dy, and M. Rolli-derkinderen. Trends inhistamine research : new functions during immune responses and hematopoiesis. Trends inImmunology, 23(5):255–263, 2002.B. Schölkopf and A. J. Smola. Learning with Kernels. MIT Press, 2002.N. J. Schork, S. S. Murray, K. A. Frazer, and E. J. Topol. Common vs. rare allele hypotheses<strong>for</strong> complex disease. Curr. Opin. Genet. Dev., 19:212–219, 2009.V. Scotet, M. P. Audrézet, M. Roussey, G. Rault, M. Blayau, M. D. Braekeleer, and C. Férec.Impact <strong>of</strong> public health strategies on the birth prevalence <strong>of</strong> cystic fibrosis in Brittany,France. Hum. Genet., 113:280–285, 2003.E. Segal, M. Shapira, A. Regev, D. Pe’er, D. Botstein, D. Koller, and N. Friedman. Modulenetworks: identifying regulatory modules and their condition-specific regulators from geneexpression data. Nat. Genet., 34:166–176, 2003.S. Shalev-Shwartz and A. Tewari. Stochastic methods <strong>for</strong> l 1 regularized loss minimization.In ICML 2009, Proceedings <strong>of</strong> the 26th International Conference on Machine Learning,volume 26, 2009.J. Shendure. The beginning <strong>of</strong> the end <strong>for</strong> microarrays? Nat. Meth., 5:585–587, 2008.N. Shimizu, S. Noda, K. Katayama, H. Ichikawa, H. Kodama, and H. Miyoshi. Identification<strong>of</strong> genes potentially involved in supporting hematopoietic stem cell activity <strong>of</strong> stromal cellline MC3T3-G2/PA6. International journal <strong>of</strong> hematology, 87:239–45, 2008.T. Sing, O. Sander, N. Beerenwinkel, and T. Lengauer. ROCR: visualizing classifier per<strong>for</strong>mancein R. Bioin<strong>for</strong>matics, 21:3940–3941, 2005.G. K. Smyth and T. Speed. Normalization <strong>of</strong> cDNA Microarray Data. Methods, 31:266–273,2003.G. K. Smyth. Linear models and empirical Bayes methods <strong>for</strong> assessing differential expressionin microarray experiments. Statist. Appl. Genet. Mol. Biol., 3, 2004.G. K. Smyth. Limma: linear models <strong>for</strong> microarray data. In R. Gentleman, V. Carey, S. Dudoit,and W. H. R. Irizarry, editors, Bioin<strong>for</strong>matics and Computational Biology Solutionsusing R and Bioconductor, pages 397–420. Springer, New York, 2005.S. Sonnenburg, A. Zien, and G. Rätsch. ARTS: accurate recognition <strong>of</strong> transcription startsin <strong>human</strong>. Bioin<strong>for</strong>matics, 22:e472–e480, 2006.T. Sørlie, C. M. Perou, R. Tibshirani, T. Aas, S. Geisler, H. Johnsen, T. Hastie, M. B.Eisen, M. van de Rijn, S. S. Jeffrey, T. Thorsen, H. Quist, J. C. Matese, P. O. Brown,275


BibliographyD. Botstein, P. E. Lønning, and A. L. Børresen-Dale. Gene expression patterns <strong>of</strong> breastcarcinomas distinguish tumor subclasses with clinical implications. Proc. Natl. Acad. Sci.,98:10869–10874, 2001.T. Sørlie, R. Tibshirani, J. Parker, T. Hastie, J. S. Marron, A. Nobel, S. Deng, H. Johnsen,R. Pesich, S. Geisler, J. Demeter, C. M. Perou, P. E. Lønning, P. O. Brown, A.-L. Børresen-Dale, and D. Botstein. Repeated observation <strong>of</strong> breast tumor subtypes in independent geneexpression data sets. Proc. Natl. Acad. Sci., 100:8418–8423, 2003.C. Sotiriou and M. J. Piccart. Taking gene-expression pr<strong>of</strong>iling to the clinic: when willmolecular signatures become relevant to patient care? Nat. Rev. Cancer, 7:545–553, 2007.C. Sotiriou and L. Pusztai. Gene-expression signatures in breast cancer. N. Eng. J. Med.,360:790–800, 2009.C. Sotiriou, P. Wirapati, S. Loi, A. Harris, S. Fox, J. Smeds, H. Nordgren, P. Farmer, V. Praz,B. Haibe-Kains, C. Desmedt, D. Larsimont, F. Cardoso, H. Peterse, D. Nuyten, M. Buyse,M. J. V. de Vijver, J. Bergh, M. Piccart, and M. Delorenzi. Gene Expression Pr<strong>of</strong>ilingin Breast Cancer: Understanding the Molecular Basis <strong>of</strong> Histologic Grade To ImprovePrognosis. J. Natl. Cancer Inst., 98:262–272, 2006.S. Soumian, C. Albrecht, A. Davies, and R. Gibbs. ABCA1 and atherosclerosis. VascularMedicine, 10(2):109–120, May 2005.F. J. Staal, M. van der Burg, L. F. Wessels, B. H. Barendregt, M. R. Baert, C. M. van denBurg, C. van Huffel, A. W. Langerak, V. H. van der Velden, M. J. Reinders, and J. J. vanDongen. Dna microarrays <strong>for</strong> comparison <strong>of</strong> gene expression pr<strong>of</strong>iles between diagnosis andrelapse in precursor-b acute lymphoblastic leukemia: choice <strong>of</strong> technique and purificationinfluence the identification <strong>of</strong> potential diagnostic markers. Leukemia, 17:1324–1332, 2003.C. Staiger, S. Cadot, R. Kooter, M. Dittrich, T. Mueller, G. W. Klau, and L. F. A. Wessels.A critical evaluation <strong>of</strong> network and pathway based classifiers <strong>for</strong> outcome prediction inbreast cancer. arXiv:1110.3717v2, 2011.J. D. Storey and R. Tibshirani. Statistical significance <strong>for</strong> <strong>genome</strong><strong>wide</strong> studies. Proc. Natl.Acad. Sci., 100:9440–9445, 2003.B. E. Stranger, M. S. Forrest, A. G. Clark, M. J. Minichiello, S. Deutsch, R. Lyle, S. Hunt,B. Kahl, S. E. Antonarakis, S. Tavaré, P. Deloukas, and E. T. Dermitzakis. Genome-<strong>wide</strong>associations <strong>of</strong> gene expression variation in <strong>human</strong>s. PLoS Genetics, 1:e78, 2005.B. E. Stranger, A. C. Nica, M. S. Forrest, A. Dimas, C. P. Bird, C. Beazley, C. E. Ingle,M. Dunning, P. Flicek, D. Koller, S. Montgomery, S. Tavaré, P. Deloukas, and E. T.276


BibliographyThe Wellcome Trust Case Control Consortium. Genome-<strong>wide</strong> association study <strong>of</strong> 14,000cases <strong>of</strong> seven common diseases and 3,000 shared controls. Nature, 447:661–678, 2007.L. Tian, S. A. Greenberg, S. W. Kong, J. Altschuler, I. S. Kohane, and P. J. Park. Discoveringstatistically significant pathways in expression pr<strong>of</strong>iling studies. Proc. Natl. Acad. Sci.,102:13544–13549, 2005.R. J. Tibshirani and B. Efron. Pre-validation and inference in microarrays. Statist. Appl.Genet. Mol. Biol., 1:1, 2002.R. Tibshirani, T. Hastie, B. Narasimhan, and G. Chu. Class Prediction by Nearest ShrunkenCentroids, with Applications to DNA Microarrays. Stat. Sci., 18:104–117, 2003.R. Tibshirani. Regression Shrinkage and Selection via the Lasso. J. R. Statist. Soc. B,58:267–288, 1996.P. Törönen, P. J. Ojala, P. Maartinen, and L. Holm. Robust extraction <strong>of</strong> functional signalsfrom gene set <strong>analysis</strong> using a generalized threshold free scoring function. BMC Bioinfo.,10:307, 2009.G. Trynka, K. A. Hunt, N. A. Bockett, J. Romanos, V. Mistry, A. Szperl, S. F. Bakker, M. T.Bardella, L. Bhaw-Rosun, et al. Dense genotyping identifies and localizes multiple commonand rare variant association signals in celiac disease. Nat. Genet., 2011. advance onlinepublication.P. Tseng. Convergence <strong>of</strong> a Block Coordinate Descent Method <strong>for</strong> Nondifferentiable Minimization.J. Opt. Theory Appl., 109:475–494, 2001.T. Tukiainen, J. Kettunen, A. J. Kangas, L.-P. Lyytikåinen, et al. Detailed metabolic andgenetic characterization reveals new associations <strong>for</strong> 30 known lipid loci. Hum. Mol. Genet.,2011. To appear.J. Y. Uriu-Adams and C. L. Keen. Copper, oxidative stress, and <strong>human</strong> health. Molecularaspects <strong>of</strong> medicine, 26:268–98, 2005.M. J. van de Vijver, Y. D. He, L. J. van ’t Veer, H. Dai, A. A. M. Hart, D. W. Voskuil, G. J.Schreiber, J. L. Peterse, C. Roberts, M. J. Marton, M. Parrish, D. Atsma, A. Witteveen,A. Glas, L. Delahaye, T. van der Velde, H. Bartelink, S. Rodenhuis, E. T. Rutgers, S. H.Friend, and R. Bernards. A gene-expression signature as a predictor <strong>of</strong> survival in breastcancer. New Engl. J. Med., 347:1999–2009, 2002.A. J. Van der Kooij. Prediction Accuracy and Stability <strong>of</strong> Regression with Optimal ScalingTrans<strong>for</strong>mations. PhD thesis, Faculty <strong>of</strong> Social and Behavioural Sciences, Leiden University,2007.278


BibliographyP. J. van Diest, E. van der Wall, and J. P. A. Baak. Prognostic value <strong>of</strong> proliferation ininvasive breast cancer: a review. J. Clin. Pathol., 57:675–681, 2004.D. A. van Heel and J. West. Recent advances in coeliac disease. Gut, 55:1037–1046, 2006.D. A. van Heel, L. Franke, K. A. Hunt, R. Gwilliam, et al. A <strong>genome</strong>-<strong>wide</strong> association study<strong>for</strong> celiac disease identifies risk variants in the region harboring il2 and il21. Nat. Genet.,39:827–829, 2007.L. J. van ’t Veer, H. Dai, M. J. van de Vijver, Y. D. He, A. A. M. Hart, M. Mao, H. L.Peterse, K. van der Kooy, M. J. Marton, A. T. Witteveen, G. J. Schreiber, R. M. Kerkhoven,C. Roberts, P. S. Linsley, R. Bernards, and S. H. Friend. Gene expression pr<strong>of</strong>iling predictedclinical outcome <strong>of</strong> breast cancer. Nature, 415:530–536, 2002.M. H. van Vliet, C. N. Klijn, L. F. A. Wessels, and M. J. T. Reinders. Module-Based OutcomePrediction Using Breast Cancer Compendia. PLoS ONE, 2, 2007.D. Venet, J. E. Dumont, and V. Detours. Most Random Gene Expression Signatures AreSignificantly Associated with Breast Cancer Outcome. PLoS Comp. Biol., 7:e1002240,2011.J. C. Venter, M. D. Adams, E. W. Myers, P. W. Li, R. J. Mural, G. G. Sutton, H. O. Smith,M. Yandell, C. A. Evans, R. A. Holt, J. D. Gocayne, P. Amanatides, R. M. Ballew, D. H.Huson, J. R. Wortman, Q. Zhang, C. D. Kodira, X. H. Zheng, L. Chen, M. Skupski, G. Subramanian,P. D. Thomas, J. Zhang, G. L. G. Miklos, C. Nelson, S. Broder, A. G. Clark,J. Nadeau, V. A. McKusick, N. Zinder, A. J. Levine, R. J. Roberts, M. Simon, C. Slayman,M. Hunkapiller, R. Bolanos, A. Delcher, I. Dew, D. Fasulo, M. Flanigan, L. Florea,A. Halpern, S. Hannenhalli, S. Kravitz, S. Levy, C. Mobarry, K. Reinert, K. Remington,J. Abu-Threideh, E. Beasley, K. Biddick, V. Bonazzi, R. Brandon, M. Cargill, I. Chandramouliswaran,R. Charlab, K. Chaturvedi, Z. Deng, V. D. Francesco, P. Dunn, K. Eilbeck,C. Evangelista, A. E. Gabrielian, W. Gan, W. Ge, F. Gong, Z. Gu, P. Guan, T. J.Heiman, M. E. Higgins, R. R. Ji, Z. Ke, K. A. Ketchum, Z. Lai, Y. Lei, Z. Li, J. Li, Y. Liang,X. Lin, F. Lu, G. V. Merkulov, N. Milshina, H. M. Moore, A. K. Naik, V. A. Narayan,B. Neelam, D. Nusskern, D. B. Rusch, S. Salzberg, W. Shao, B. Shue, J. Sun, Z. Wang,A. Wang, X. Wang, J. Wang, M. Wei, R. Wides, C. Xiao, C. Yan, A. Yao, J. Ye, M. Zhan,W. Zhang, H. Zhang, Q. Zhao, L. Zheng, F. Zhong, W. Zhong, S. Zhu, S. Zhao, D. Gilbert,S. Baumhueter, G. Spier, C. Carter, A. Cravchik, T. Woodage, F. Ali, H. An, A. Awe,D. Baldwin, H. Baden, M. Barnstead, I. Barrow, K. Beeson, D. Busam, A. Carver, A. Center,M. L. Cheng, L. Curry, S. Danaher, L. Davenport, R. Desilets, S. Dietz, K. Dodson,L. Doup, S. Ferriera, N. Garg, A. Gluecksmann, B. Hart, J. Haynes, C. Haynes, C. Heiner,S. Hladun, D. Hostin, J. Houck, T. Howland, C. Ibegwam, J. Johnson, F. Kalush, L. Kline,279


BibliographyS. Koduru, A. Love, F. Mann, D. May, S. McCawley, T. McIntosh, I. McMullen, M. Moy,L. Moy, B. Murphy, K. Nelson, C. Pfannkoch, E. Pratts, V. Puri, H. Qureshi, M. Reardon,R. Rodriguez, Y. H. Rogers, D. Romblad, B. Ruhfel, R. Scott, C. Sitter, M. Smallwood,E. Stewart, R. Strong, E. Suh, R. Thomas, N. N. Tint, S. Tse, C. Vech, G. Wang, J. Wetter,S. Williams, M. Williams, S. Windsor, E. Winn-Deen, K. Wolfe, J. Zaveri, K. Zaveri,J. F. Abril, R. Guig, M. J. Campbell, K. V. Sjolander, B. Karlak, A. Kejariwal, H. Mi,B. Lazareva, T. Hatton, A. Narechania, K. Diemer, A. Muruganujan, N. Guo, S. Sato,V. Bafna, S. Istrail, R. Lippert, R. Schwartz, B. Walenz, S. Yooseph, D. Allen, A. Basu,J. Baxendale, L. Blick, M. Caminha, J. Carnes-Stine, P. Caulk, Y. H. Chiang, M. Coyne,C. Dahlke, A. Mays, M. Dombroski, M. Donnelly, D. Ely, S. Esparham, C. Fosler, H. Gire,S. Glanowski, K. Glasser, A. Glodek, M. Gorokhov, K. Graham, B. Gropman, M. Harris,J. Heil, S. Henderson, J. Hoover, D. Jennings, C. Jordan, J. Jordan, J. Kasha, L. Kagan,C. Kraft, A. Levitsky, M. Lewis, X. Liu, J. Lopez, D. Ma, W. Majoros, J. McDaniel,S. Murphy, M. Newman, T. Nguyen, N. Nguyen, M. Nodell, S. Pan, J. Peck, M. Peterson,W. Rowe, R. Sanders, J. Scott, M. Simpson, T. Smith, A. Sprague, T. Stockwell,R. Turner, E. Venter, M. Wang, M. Wen, D. Wu, M. Wu, A. Xia, A. Zandieh, and X. Zhu.The sequence <strong>of</strong> the <strong>human</strong> <strong>genome</strong>. Science, 291:1304–1351, 2001.J.-B. Veyrieras, S. Kudaravalli, S.-Y. Kim, E. T. Dermitzakis, Y. Gilad, M. Stephens, andJ. K. Pritchard. High-Resolution Mapping <strong>of</strong> Expression-QTLs Yields Insight into HumanGene Regulation. PLoS Genet., 4:e1000214, 2008.P. M. Visscher, W. G. Hill, and N. R. Wray. Heritability in the genomics era–concepts andmisconceptions. Nat. Rev. Genet., 9:255–66, 2008.Y.-H. Wang and T. P. Speed. Design and <strong>analysis</strong> <strong>of</strong> comparative microarray experiments. InT. P. Speed, editor, Statistical Analysis <strong>of</strong> Gene Expression Microarray Data. CRC Press,2003.Y. Wang, J. G. Klijn, Y. Zhang, A. M. Sieuwerts, M. P. Look, F. Yang, D. Talantov,M. Timmermans, M. M. van Gelder, J. Yu, T. Jatkoe, E. M. Berns, D. Atkins, and J. A.Foekens. Gene-expression pr<strong>of</strong>iles to predict distant metastasis <strong>of</strong> lymph-node-negativeprimary breast cancer. The Lancet, 365:671–679, 2005.Z. Wang, M. Gerstein, and M. Snyder. RNA-Seq: a revolutionary tool <strong>for</strong> transcriptomics.Nat. Rev. Genet., 10:57–63, 2009.T. J. Wang, M. G. Larson, R. S. Vasan, S. Cheng, E. P. Rhee, E. McCabe, G. D. Lewis, C. S.Fox, P. F. Jacques, C. Fernandez, C. J. O’Donnell, S. a. Carr, V. K. Mootha, J. C. Florez,A. Souza, O. Melander, C. B. Clish, and R. E. Gerszten. Metabolite pr<strong>of</strong>iles and the risk<strong>of</strong> developing diabetes. Nat. Med., 17:448–453, 2011.280


BibliographyZ. Wei and H. Li. Nonparametric pathway-based regression models <strong>for</strong> <strong>analysis</strong> <strong>of</strong> genomicdata. Biostatistics, 8(2):265–84, 2007.Z. Wei, K. Wang, H. Q. Qu, H. Zhang, J. Bradfield, C. Kim, E. Frackleton, et al. Fromdisease association to risk assessment: an optimistic view from <strong>genome</strong>-<strong>wide</strong> associationstudies on type 1 diabetes. PLoS Genet., 5:e1000678, 2009.S. Weidinger, C. Gieger, E. Rodriguez, H. Baurecht, M. Mempel, N. Klopp, H. Gohlke, S. Wagenpfeil,M. Ollert, J. Ring, H. Behrendt, J. Heinrich, N. Novak, T. Bieber, U. Krämer,D. Berdel, A. von Berg, C. P. Bauer, O. Herbarth, S. Koletzko, H. Prokisch, D. Mehta,T. Meitinger, M. Depner, E. von Mutius, L. Liang, M. M<strong>of</strong>fatt, W. Cookson, M. Kabesch,H.-E. Wichmann, and T. Illig. Genome-<strong>wide</strong> scan on total serum IgE levels identifiesFCER1A as novel susceptibility locus. PLoS genetics, 4(8):e1000166, 2008.B. Weigelt, J. L. Peterse, and L. J. van ’t Veer. Breast cancer metastasis: markers and models.Nat. Rev. Cancer, 5:591–602, 2005.Wellcome Trust Case Control Consortium. Genome-<strong>wide</strong> association study <strong>of</strong> CNVs in 16,000cases <strong>of</strong> eight common diseases and 3,000 shared controls. Nature, 464:713–720, 2010.J. Weston, A. Elisseeff, B. Schölkopf, and M. Tipping. Use <strong>of</strong> the zero-norm with linearmodels and kernel methods. J. Mach. Learn. Res., 3:1439–1461, 2003.J. Whittaker. Graphical Models in Applied Multivariate Statistics. John Wiley & Sons, 1990.P. Wirapati, C. Sotiriou, P. Farmer, S. Pradervand, B. Haibe-Kains, C. Desmedt, M. Ignatiadis,T. Sengstag, F. Schutz, D. Goldstein, M. Piccart, and M. Delorenzi. Meta-<strong>analysis</strong><strong>of</strong> gene expression pr<strong>of</strong>iles in breast cancer: toward a unified understanding <strong>of</strong> breast cancersubtyping and prognosis signatures. Breast Cancer Res., 10:R65, 2008.B. S. Wittner, D. C. Sgroi, P. D. Ryan, T. J. Bruinsma, A. M. Glas, A. Male, S. Dahiya,K. Habin, R. Bernards, D. A. Haber, L. J. V. Veer, and S. Ramaswamy. Analysis <strong>of</strong> themammaprint breast cancer assay in a predominantly postmenopausal cohort. Clin. CancerRes., 14:2988–2993, 2008.N. R. Wray, J. Yang, M. E. Goddard, and P. M. Visscher. The Genetic Interpretation <strong>of</strong> Areaunder the ROC Curve in Genomic Pr<strong>of</strong>iling. PLoS Genet., 6:e1000864, 2010.T.-T. Wu, Y. F. Chen, T. Hastie, E. Sobel, and K. Lange. Genome-<strong>wide</strong> association <strong>analysis</strong>by lasso penalized logistic regression. Bioin<strong>for</strong>matics, 25:714–721, 2009.X. Xie, J. Lu, E. Kulbokas, T. Golub, V. Mootha, K. Lindblad-Toh, E. Lander, and M. Kellis.Systematic discovery <strong>of</strong> regulatory motifs in <strong>human</strong> promoters and 3’ UTRs by comparison<strong>of</strong> several mammals. Nature, 434:338–345, 2005.281


BibliographyC. Yang, X. Wan, Q. Yang, H. Xue, and W. Yu. Identifying main effects and epistaticinteractions from large-scale SNP data via adaptive group Lasso. BMC Bioinfo., 11 (Suppl1):S18, 2010.J. Yang, S. H. Lee, M. E. Goddard, and P. M. Visscher. GCTA: A Tool <strong>for</strong> Genome-<strong>wide</strong>Complex Trait Analysis. Am. J. Hum. Genet., 88:76–82, 2011.M. Yousef, S. Jung, L. C. Showe, and M. K. Showe. Recursive cluster elimination (RCE) <strong>for</strong>classification and feature selection from gene expression data. BMC Bioinfo., 8:article 144,2007.H.-F. Yu, C.-J. Hsieh, K.-W. Chang, and C.-J. Lin. Large linear classification when datacannot fit in memory. In 16th ACM KDD, 2010.E. Zeggini, L. J. Scott, R. Saxena, B. F. Voight, and the DIAGRAM Consortium. Meta<strong>analysis</strong><strong>of</strong> <strong>genome</strong>-<strong>wide</strong> association data and large-scale replication identifies susceptibilityloci <strong>for</strong> type 2 diabetes. Nat. Genet., 40:638–645, 2008.P. Zhao and B. Yu. On model selection consistency <strong>of</strong> lasso. J. Mach. Learn. Res., 7:2541–2563, 2006.J. Zhu and T. Hastie. Kernel logistic regression and the import vector machine. J. Comput.Graph. Stat., 14:185–205, 2005.H. Zou and T. Hastie. Regularization and variable selection via the elastic net. J. R. Statist.Soc. B, 67:301–320, 2005.H. Zou. The adaptive lasso and its oracle properties. J. Amer. Stat. Assoc., 101:1418–1429,2006.282

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!