Scalable approaches for analysis of human genome-wide ...

Scalable Approaches for Analysis of HumanGenome-Wide Expression and GeneticVariation DataGad AbrahamSubmitted in total fulfilment of the requirementsof the degree of Doctor of PhilosophyMarch 2012Department of Computing and Information SystemsThe University of MelbourneProduced on archival quality paper

AbstractOne of the major tasks in bioinformatics and computational biology is prediction of phenotypefrom molecular data. Predicting phenotypes such as disease opens the way to betterdiagnostic tools, potentially identifying disease earlier than would be detectable using othermethods. Examining molecular signatures rather than clinical phenotypes may help refinedisease classification and prediction procedures, since many diseases are known to have multiplemolecular subtypes with differing etiology, prognosis, and treatment options. Beyondprediction itself, identifying predictive markers aids our understanding of the biological mechanismsunderlying phenotypes such as disease, generating hypotheses that can be tested inthe lab.The aims of this thesis are to develop effective and efficient computational and statisticaltools for analysing large scale gene expression and genetic datasets, with an emphasis onpredictive models. Several key challenges include high dimensionality of the data, whichhas important statistical and computational implications, noisy data due to measurementerror and stochasticity of the underlying biology, and maintaining biological interpretabilitywithout sacrificing predictive performance. We begin by examining the problem of predictingbreast cancer metastasis and relapse from gene expression data. We present an alternativeapproach based on gene set statistics. Second, we address the problem of analysing largehuman case/control genetic (single nucleotide polymorphism) data, and present an efficientand scalable algorithm for fitting sparse models to large datasets. Third, we apply sparsemodels to genetic case/control datasets from eight complex human diseases, evaluating howeach one can be predicted from genotype. Fourth, we apply sparse models lasso methods toa multi-omic dataset consisting of genetic variation, gene expression, and serum metabolites,for reconstruction of genetic regulatory networks. Finally, we propose a novel multi-taskstatistical approach, intended for modelling multiple correlated phenotypes.In summary, this thesis discusses a range of predictive models and applies them to a widerange of problems, including gene expression, genetic, and multi-omic datasets. We demonstratethat such models, and particularly sparse models, are computationally feasible and canscale to large datasets, provide increased insight into the biological causes of disease, and forsome diseases have high predictive performance, allowing high-confidence disease diagnosis tobe made based on genetic data.iii

DeclarationThis is to certify that1. the thesis comprises only my original work towards the PhD except where indicated inthe Preface,2. due acknowledgement has been made in the text to all other material used,3. the thesis is less than 100,000 words in length, exclusive of tables, maps, bibliographiesand appendicesSignedv

PrefaceThis thesis incorporates the following publications:• Chapter 4 is substantially based on: G. Abraham, A. Kowalczyk, S. Loi, I. Haviv,and J. Zobel. Prediction of breast cancer prognosis using gene set statistics providessignature stability and biological context. BMC Bioinfo., 11:277, 2010. (primaryauthor: G. Abraham).• Chapter 5 is substantially based on: G. Abraham, A. Kowalczyk, J. Zobel, and M. Inouye.SparSNP: Fast and memory-efficient analysis of all SNPs for phenotype prediction.BMC Bioinfo., 13:88, 2012 (primary author: G. Abraham).• Chapters 6 is substantially based on: G. Abraham, A. Kowalczyk, J. Zobel, and M. Inouye.Sparse linear models to explain phenotypic variance and predict complex disease.in “NIPS Personalized Medicine Workshop 2011”, December 16th, 2011, Granada,Spain. (primary author: G. Abraham).• Chapter 6 is partly based on: G. Abraham, A. Kowalczyk, J. Zobel, and M. Inouye.Sparse linear models to explain phenotypic variance and predict complex disease (expandedjournal version). 2012. Under peer review. (primary author: G. Abraham).vii

AcknowledgmentThanks are due to my supervisors, Professor Justin Zobel and Dr Adam Kowalczyk. Justintaught me how to think about scientific questions, encouraged me to dig deeper in interpretingthe work of my own and others, and helped me communicate science better. Adam’s deeptechnical knowledge, experience, and determination were invaluable lessons to me and inspiredme in my work. Thanks also to Dr Michael Inouye. Mike’s wide-ranging knowledge, drive, andgenerosity have helped shape both my work and my approach to science in general. Thanksto Dr Izhak Haviv, whose passion for science is obvious to all who know him, and who wasalways willing to help and provide advice, at any time of the day (or night). Thanks to thehead of my PhD committee, Professor Peter Stuckey, for patiently guiding me in my PhDprocess towards better research and a better thesis.Thanks also go to my fellow students: Raj Gaire, Fan Shi, Gerard Wong, Ben Goudey,Shanika Kuruppu, Geoff Macintyre, and Justin Bedo. They have provided me outlets fromthe sometimes isolating PhD process, and made the process much more enjoyable.Thanks to Matthias Reumann from IBM Research and David Bannon from the VictorianLife Sciences Computing Initiative (VLSCI) for supporting my work, and thanks toVLSCI (project VR0126) and the Victorian Partnership for Advanced Computing (VPAC)for providing high-performance computing facilities. Thanks to NICTA and the University ofMelbourne for scholarship funding and travel grants. Thanks to David A. van Heel (QueenMary, University of London) for generously supplying the celiac disease data used in thisthesis.Finally, thanks are due to my family, and especially to my wife Laura, and my children Oriand Abigail, for their endless patience and love that have allowed me to complete this thesis.Funding StatementThis work was supported by the Australian Research Council, and by the NICTA VictorianResearch Laboratory. NICTA is funded by the Australian Government as represented by theDepartment of Broadband, Communications, and the Digital Economy, and the AustralianResearch Council through the ICT Centre of Excellence program. This work was made possiblethrough Victorian State Government Operational Infrastructure Support and Australianix

Government NHMRC IRIIS. Michael Inouye was supported by an NHMRC Biomedical AustralianTraining Fellowship (no. 637400). This study makes use of data generated by the WellcomeTrust Case-Control Consortium. A full list of the investigators who contributed to thegeneration of the data is available from www.wtccc.org.uk. Funding for the project was providedby the Wellcome Trust under award 076113 and 085475. Funding support for the GAINSearch for Susceptibility Genes for Diabetic Nephropathy in Type 1 Diabetes (GoKinD studyparticipants) study was provided by the Juvenile Diabetes Research Foundation (JDRF), andthe Centers for Disease Control (CDC) (PL 105-33, 106- 554, and 107-360 administered by theNational Institute of Diabetes and Digestive and Kidney Diseases, NIDDK) and the genotypingof samples was provided through the Genetic Association Information Network (GAIN).The dataset(s) used for the analyses described in this manuscript were obtained from thedatabase of Genotypes and Phenotypes (dbGaP) found at http://www.ncbi.nlm.nih.gov/gapthrough dbGaP accession number phs000018.v1.p1. Samples and associated phenotype datafor the Search for Susceptibility Genes for Diabetic Nephropathy in Type 1 Diabetes studywere provided by James H. Warram, MD, Joslin Diabetes Center.x

ContentsList of Abbreviationsxxix1. Introduction 12. Biological Background 92.1. The Central Dogma of Molecular Biology . . . . . . . . . . . . . . . . . . . . . . 92.2. The Molecular Basis for Disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3. Gene Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3.1. Measuring Gene Expression . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3.2. Challenges in Analysis of Gene Expression Microarrays . . . . . . . . . 142.4. The Genetic Basis of Disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.4.1. Linkage Disequilibrium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.4.2. Hardy-Weinberg Equilibrium . . . . . . . . . . . . . . . . . . . . . . . . . 242.4.3. SNP Microarray Technology . . . . . . . . . . . . . . . . . . . . . . . . . . 252.4.4. Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.4.5. Genome Wide Association Studies . . . . . . . . . . . . . . . . . . . . . . 272.4.6. Challenges in Analysis of SNP Microarrays . . . . . . . . . . . . . . . . . 292.4.7. The Problem of Missing Heritability . . . . . . . . . . . . . . . . . . . . . 312.4.8. Expression Quantitative-Trait Loci . . . . . . . . . . . . . . . . . . . . . . 332.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343. Review of the Analysis of Gene Expression and Genetic Data 353.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.2. Supervised Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.3. Linear Models and Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 383.4. Feature Selection — Finding Predictive & Causal Markers . . . . . . . . . . . . 403.4.1. Filter Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.4.2. Wrapper Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.4.3. Embedded Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.4.4. Other Methods for Dimensionality Reduction . . . . . . . . . . . . . . . 513.4.5. Other Methods for Classification and Regression . . . . . . . . . . . . . 52xi

Contents5.6. Software Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1155.7. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1176. Sparse Linear Models Explain Phenotypic Variation and Predict Risk of ComplexDisease 1196.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1196.2. Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1206.2.1. Genetic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1206.2.2. HAPGEN2 simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1236.2.3. Positive and negative predictive values . . . . . . . . . . . . . . . . . . . 1236.2.4. Genomic Inflation Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1246.2.5. Data and quality control . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1256.3. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1256.3.1. Recovery of Causal SNPs in Simulation . . . . . . . . . . . . . . . . . . . 1266.3.2. Modelling genome-wide profiles for eight complex diseases . . . . . . . . 1266.3.3. Assessment of confounding factors . . . . . . . . . . . . . . . . . . . . . . 1296.3.4. Discrimination of the phenotype in cross-validation . . . . . . . . . . . . 1306.3.5. Genetic models in a population context . . . . . . . . . . . . . . . . . . . 1356.3.6. Genetic substructure of celiac disease and type 1 diabetes . . . . . . . . 1366.4. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1376.5. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1387. Genetic Control of the Human Metabolic Gene Regulation 1417.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1417.2. Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1437.2.1. Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1437.2.2. Predictive Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1457.2.3. Causal Network Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 1487.3. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1507.3.1. Predictive Models of Metabolites using Gene Expression . . . . . . . . . 1507.3.2. Integrating the Metabolite Models with Models of Gene Expressionbased on SNPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1557.3.3. Linking the Causal Networks to Fasting Glucose Levels and Type 2Diabetes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1587.4. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1678. Fused Multitask Penalised Regression 1718.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1718.2. Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173xiii

Contents8.3. Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1748.3.1. Fused Multitask Penalised Regression . . . . . . . . . . . . . . . . . . . . 1748.3.2. Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1768.3.3. Computational Enhancements . . . . . . . . . . . . . . . . . . . . . . . . . 1778.4. Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1778.5. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1828.5.1. Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1828.5.2. Experiments on the DILGOM Dataset . . . . . . . . . . . . . . . . . . . 1848.5.3. Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1968.6. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1979. Conclusions 199A. Supplementary Results for Gene Set Statistics 205A.1. Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205A.2. Internal Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206A.3. External Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206B. Supplementary Results for Sparse Linear Models 217B.1. Scoring Measures for Causal SNP Detection . . . . . . . . . . . . . . . . . . . . . 217B.2. Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219B.2.1. Checking for Stratification . . . . . . . . . . . . . . . . . . . . . . . . . . . 219B.2.2. AUC for Stringent Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . 220B.2.3. PPV/NPV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221B.2.4. Comparison with Other Methods . . . . . . . . . . . . . . . . . . . . . . . 222B.2.5. Principal Component Analysis of Cases . . . . . . . . . . . . . . . . . . . 240B.3. Results for each dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242B.3.1. Bipolar Disorder (BD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243B.3.2. Coronary Artery Disease (CAD) . . . . . . . . . . . . . . . . . . . . . . . 244B.3.3. Celiac Disease (Celiac) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245B.3.4. Crohn’s Disease/Inflammatory Bowel Disease (Crohn’s) . . . . . . . . . 247B.3.5. Hypertension (HT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248B.3.6. Rheumatoid Arthritis (RA) . . . . . . . . . . . . . . . . . . . . . . . . . . 249B.3.7. Type 1 Diabetes (WTCCC-T1D) . . . . . . . . . . . . . . . . . . . . . . . 250B.3.8. Type 2 Diabetes (T2D) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251C. Supplementary Results for FMPR 253Bibliography 255xiv

List of Figures2.1. The Central Dogma of molecular biology. . . . . . . . . . . . . . . . . . . . . . . 102.2. An outline of the gene expression microarray experiment, for spotted cDNA(left) and oligonucleotide arrays (right). Reprinted by permission from MacmillanPublishers Ltd (Staal et al., 2003), copyright (2003). . . . . . . . . . . . . 142.3. A revised Central Dogma of molecular biology. We distinguish between aclinical phenotype, which is a high level phenotype such as case/control status,and other phenotypes — many of the other nodes can be considered phenotypesin their own right, such as gene expression (mRNA) in eQTL studies. . . . . . 192.4. Different phenotypes are characterised by different combinations of variantfrequency and effect size. Reprinted by permission from Macmillan PublishersLtd (Manolio et al., 2009), copyright (2009). . . . . . . . . . . . . . . . . . . . . 212.5. Association for SNPs in chromosome 13 (q22.1) with pancreatic cancer. Thediamonds, and squares represent the log 10 p-values for association of SNPs.Overlaid are the recombination rates (centimorgan per megabase). On thebottom is a heatmap showing the LD between the SNPs, measured by r 2 .Reprinted by permission from Macmillan Publishers Ltd (Petersen et al., 2010),copyright (2010). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.6. Measured intensities for two alleles of one locus, over all samples in the 1958birth cohort of the WTCCC data (The Wellcome Trust Case Control Consortium,2007). The samples coloured red, green, and blue are called as BB,AB, and AA, respectively. The light blue colour represents missing calls (CHI-AMO genotype calls made with posterior probability < 0.9). The left andright panels shows the calls before and after imputing the missing calls, respectively.Reprinted by permission from Macmillan Publishers Ltd (Marchini etal., 2007), copyright (2007). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.7. The family-wise Type 1 error rate for k independent tests. The per-test thresholdis α = 0.05. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28xv

List of Figures3.1. An illustration of the relationship between true and empirical risk as modelcomplexity increases. The Bayes risk is shown constant since it assumes a fixedmodel complexity, which is the “correct” (but unknown) model. Empirical riskis the risk observed for a given model in a given finite dataset. On the far lefthand side, the model can be said to be underfitting, as the empirical risk ishigher than the true risk. On the right hand side, the model is overfitting, asit has lower empirical risk than the true risk. . . . . . . . . . . . . . . . . . . . 373.2. Four loss functions for classification: 0/1 loss L(z i ) = I(z i < 1), logistic regressionL(z i ) = log(1 + exp(−z i )), hinge loss L(z i ) = max{0, 1 − z i }, and squaredhinge loss L(z i ) = max{0, 1−z i } 2 , where z i = y i (β 0 +x T i β) for linear models. Informally,for z ≥ 1 the predicted and observed classes match sign(ŷ i ) = sign(y i )(correct classification), and for z < 1 they do not match (mis-classification). . 393.3. A toy example of a three-gene network. If gene A is mutated and causes adownstream effect in genes B and C, then all three genes may appear to beassociated with the phenotype, even though gene B is clearly non-causal. . . 413.4. A contingency table of two alleles versus the case-control status, in terms of(a) counts, (b) conditional probabilities Pr(y∣x), and (c) the odds. . . . . . . . 433.5. Penalised squared loss in 2 dimensions. The red contours show the curves ofconstant loss for different solution pairs (β 1 , β 2 ), and β ∗ is the unpenalisedsolution. Also shown are the feasible regions (in cyan) imposed by the ridgeconstraints β 2 1 + β2 2 ≤ t (left), and by the lasso constraints ∣β 1∣ + ∣β 2 ∣ ≤ t (right).Adapted from Hastie et al. (2009a). . . . . . . . . . . . . . . . . . . . . . . . . . 484.1. A heatmap showing differentially expressed genes (rows) over a subset of 250samples (columns) from the five breast cancer datasets. Differential expressionwas determined using linear models in limma (Smyth, 2005).Samples arecoloured red and blue for < 5 years and ≥ 5 years to metastasis, respectively.Under and over expressed genes are coloured red and green, respectively. . . 604.2. Schematic of how the gene set features are constructed from three gene sets S 1(red), S 2 (green), and S 3 (blue), each with 2, 3, and 1 gene/s, respectively. Notethat for clarity we show non-overlapping sets, although the sets can overlap inpractice. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.3. Average and 95% confidence intervals for AUC from external validation betweenthe five datasets, n = 2 × ( 5 )2= 20 (train,test) pairs, for different numbersof features. We show only every second confidence interval for clarity. Notethat each dataset ranks its features independently, hence, the kth feature isnot necessarily the same across datasets. Individual genes are denoted raw. . 73xvi

List of Figures4.4. Variance and 95% confidence intervals of the AUC from external validationbetween the five datasets, n = 2 × ( 5 )2= 20 (train,test) pairs, for differentnumbers of features. The confidence intervals are [(n − 1)s 2 /χ 2 α/2,n−1 , (n −1)s 2 /χ 2 1−α/2,n−1 ], where χ2 α is the α = 0.05 quantile for a chi-squared distributionwith n − 1 degrees of freedom, and s 2 is the sample variance. . . . . . 754.5. Mean and 2.5%/97.5% of the ranks of genes and gene sets. Ranks are based onthe weight assigned by the centroid classifier to each feature. For gene sets, weused the set centroid statistic. The process was repeated over 5000 bootstrapreplications of the GSE4922 dataset. Features have been sorted by their meanrank. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 764.6. Spearman rank-correlation of the centroid classifier’s weights from the fivedatasets (n = 10 comparisons). Individual genes are denoted raw. . . . . . . . 774.7. Concordance of feature lists (genes or gene sets) for different cutoffs f =1, . . . , 200, counting the number of features occurring in all of the five datasets’lists, ranked higher than f. We use raw to denote individual genes. Priorto ranking, we selected 4120 genes (for the raw lists) or gene sets (for the setstatistics) to be ranked, so that the number unique of items was identical acrossall lists. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814.8. Kolmogorov-Smirnov enrichment for MSigDB categories, using the set-centroidstatistic. (A) AUC and spline smooth for each set, tested on GSE11121. (B)Number of mapped probesets in each set, on log 2 scale, and spline smooth. (C)Two-sample Kolmogorov-Smirnov Brownian-bridge for each MSigDB category(p-values: C1: 1.44×10 −4 , C2: 3.55×10 −15 , C3: < 2.22×10 −16 , C4: 4.22×10 −13 ,C5: 2.38×10 −2 ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 844.9. AUC and weight versus set size for the set centroid statistic, using the centroidclassifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 864.10. Expression of ESR1 (ER) versus ERBB2 (HER2) for the combined dataset. Amixture of three Gaussians is fitted to the data. Clusters 1, 2, and 3 representthe ER−/HER2−, ER+/HER2−, and HER2+ subtypes, respectively. . . . . . 885.1. Time (in seconds) for model fitting, over sub-samples of the entire celiac diseasedataset, taken as the minimum over 10 independent runs. (a) For all methodsincluding hyperlasso. (b) Excluding hyperlasso. For in-memory methodswe included the time to read the binary data into R. For SparSNP and glmnetwe used a λ grid of size 20, and a maximum model size of 2048 SNPs. liblinearused C = 1. hyperlasso used one iteration with λ = 1 (DE prior). The leftpanel includes all four methods, the right panel excludes hyperlasso. Theinsets show the leftmost panel (50,000 SNPs) on its own scale to better visualisethe differences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112xvii

List of Figures5.2. Left: LOESS-smoothed AUC and explained phenotypic variance (denoted “Var-Exp”) for the Finnish celiac disease dataset, for increasing model sizes. Forliblinear-cdblock (LL-CD-L2), all 516,504 SNPs are included in the model.AUC is estimated over 30 × 3-fold cross-validation. The explained phenotypicvariance is estimated from the AUC using the method of Wray et al. (2010),assuming a population prevalence of K = 1%. . . . . . . . . . . . . . . . . . . . 1145.3. An example pipeline for analysing a SNP discovery dataset with SparSNPand testing the model on a validation dataset. Most of the data preparationand processing can be done with PLINK. . . . . . . . . . . . . . . . . . . . . . . 1166.1. APRC for HAPGEN simulations, using either lasso squared-hinge loss models(lasso) or the univariable logistic regression Wald test (univariable). For thelasso, different numbers of SNPs are allowed in the model, as determined by thepenalty λ. For the univariable test, all SNPs are considered. For lasso, resultsare smoothed using LOESS over the replications. For univariable, results areaveraged over the replications. The dotted vertical lines show the number oftrue “causal” SNPs in the data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1276.2. (a) Area under the receiver operating characteristic Curve (AUC) for modelsof the 9 case/control datasets. Results are LOESS-smoothed over 20 × 3-foldcross-validation. See the Supplementary Results for details on each disease.(b) LOESS-smoothed proportion of phenotypic variance explained for the lassomodels for the 9 discovery datasets, using the method of Wray et al. (2010). 1316.3. Lasso models can achieve high positive predictive values. PPV versus NPVfor the lasso models of the 9 discovery datasets. Results are averaged over20 × 3CV. See the Supplementary Results for the number of SNPs with nonzerocoefficients in each dataset. Note that the curves do not span the entirerange of NPV since not all sensitivity and specificity values can be observed ina finite dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1336.4. Genetic subclasses of celiac disease cases exhibit high predictability. The PCsare obtained from PCA of the genotypes belonging to the SNPs identified bythe lasso models with ∼100 SNPs in cross-validation, for (a) the original Celiac1dataset and (b) a stringently-filtered version of the Celiac1 dataset. Sampleswith a median specificity ≥ 0.99 in prediction of cases are highlighted in red. 1367.1. Schematic diagram of our analysis pipeline. . . . . . . . . . . . . . . . . . . . . 1447.2. Decision tree for inferring the causal graph structure based on the pattern ofmarginal and partial correlations, assuming that cis-QTLs are causal to thegene. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150xviii

List of Figures7.3. R 2 for regressing the metabolites on all gene probes, together with all clinicalvariables (model 1), or after removing the effect of the clinical variables(model 2), showing the top 10 for each model. The results for all metabolitesare shown in the insets. Metabolites were sorted in descending order of R 2 . R 2was estimated with nested cross-validation. Note the different scales. . . . . . 1517.4. The top 10 variables (metabolites+genes for model 1 and genes for model 2) asselected as predictors of metabolite variation in models 1 and 2. The genes wereranked by the proportion of metabolites for which each gene was selected bythe lasso regression (a, b) or the proportion times the R 2 for the correspondingmetabolite (c, d), in order to upweight genes that are not only included aspredictors of many metabolites but are also more highly predictive. Each insetshows all the variables for each model. Note the different scales. . . . . . . . . 1527.5. Ratio of R 2 in model 2 to R 2 in model 1 of metabolites, sorted in decreasingorder. Large figure: metabolites with ratios ≥ 0.5. Inset: all 98 metaboliteswith positive R 2 in model 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1547.6. Hierarchical clustering of the metabolites, using complete linkage. . . . . . . . 1567.7. Box-and-whisker plots of R 2 in model 2 each metabolite cluster, predictingmetabolite concentrations from gene expression. . . . . . . . . . . . . . . . . . . 1577.8. Box-and-whisker plots of cross-validated R 2 for the stable genes associated witheach metabolite (predicted from the SNPs), compared with an aggregation ofall metabolites (“All”) and a random set of genes (“Random”). Also shown isthe number of stable genes associated with each metabolite. . . . . . . . . . . 1597.9. Inferred network of regulation for serum triglycerides. Inferred causal edges areshown as solid edges. Dashed edges represent trans-QTLs, where direct causaleffect on the gene cannot be inferred. The edges widths are proportional tothe R 2 of the marginal association between the nodes from a univariable linearregression (in parentheses). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1607.10. Metabolites selected as stably associated with fasting glucose, as selected bylasso regression, correcting for the effect of the clinical variables. The edgeweights show the exponentiated weights exp(β), corresponding to increases in(a) fasting glucose and (b) odds ratio of fasting glucose ≥ 7 mmol/L, respectively,for a one standard deviation increase in each metabolite, averaged overthe cross-validation replications. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1627.11. Inferred causal networks for three metabolites stably associated with fastingglucose levels. The edge weights are the R 2 from a univariable linear regressionof each child node on each parent node. The R 2 from a multivariable lasso linearregression on all inputs (SNPs for genes and genes for metabolites) is shown inparenthesis next to each node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165xix

List of Figures7.12. Top 5 principal components from PCA of the genotype data. . . . . . . . . . . 1668.1. An illustration of a hypothetical setup in which five genes G 1 , ..., G 5 are associatedwith five metabolites M 1 , ..., M 5 . Several metabolites share the same geneassociations (solid lines), and therefore are correlated with each other (correlationshown by dashed lines). By leveraging the inter-metabolite correlations,multi-task methods such as fmpr and GFlasso aim to better identify whichinputs (genes in this case) are truly associated with which outputs (metabolites),while avoiding spurious associations due to effects such as noise, underthe assumption that correlated outputs are caused by common regulators(pleiotropic genes). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1728.2. The solution path of fmpr for one parameter β j over K = 10 tasks, for increasingγ and with λ = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1758.3. An illustration of the three sparsity setups used in the multi-task simulations.Top row: absolute values of the p × K weight matrix B used for generatingthe outputs y, for models 1, 2, and 3, respectively (model 4 has identicalweights and correlations in absolute value to model 1). Bottom row: the K ×Kcorrelation matrices of the outputs y. . . . . . . . . . . . . . . . . . . . . . . . . 1818.4. Simulations with varying number of samples N (Setup 1), showing recoveryof true causal variables (ROC/PRC) in the training data and R 2 in test setprediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1858.5. Simulations with varying levels of noise σ (Setup 2), showing recovery of truecausal variables (ROC/PRC) in the training data and R 2 in test set prediction. 1868.6. Simulations with varying number of tasks K (Setup 3), showing recovery of truecausal variables (ROC/PRC) in the training data and R 2 in test set prediction. 1878.7. Simulations with varying weights β (Setup 4), showing recovery of true causalvariables (ROC/PRC) in the training data and R 2 in test set prediction. . . . 1888.8. Simulations with varying number of parameters p (Setup 5), showing recoveryof true causal variables (ROC/PRC) in the training data and R 2 in test setprediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1898.9. Simulations with same sparsity but different weights β (Setup 6), showingrecovery of true causal variables (ROC/PRC) in the training data and R 2 intest set prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1908.10. Simulations with unrelated tasks (Setup 7), showing recovery of true causalvariables (ROC/PRC) in the training data and R 2 in test set prediction. . . . 1918.11. Simulations with a mixture of positively and negatively correlated tasks (roughly50%/50% each, Setup 8), showing recovery of true causal variables (ROC/PRC)in the training data and R 2 in test set prediction. . . . . . . . . . . . . . . . . 192xx

List of Figures8.12. The true non-zero simulation weights β, and the weights estimated by eachmethod in one replication of the reference setup. The intensity of the lines representsthe absolute value of the estimated weight ˆβ. The vertical and horizontalaxes correspond to variables j = 1, ..., p and tasks k = 1, ..., K, respectively.Note that the weights of the lasso model were all zero. . . . . . . . . . . . . . . 1938.13. Pearson correlations for 35 metabolites from cluster 1 of the DILGOM metabolites.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1948.14. Recovered weights matrices ˆB for 200 genes over the 35 metabolites for lassoand fmpr-w2, based on penalties optimised by cross-validation. The verticaland horizontal axes represent genes and tasks, respectively. The intensity ofeach point represents the absolute value of the weight β. . . . . . . . . . . . . 1958.15. Box-and-whisker plots of R 2 for fmpr-w2 and lasso over 35 metabolites fromcluster 1 of the DILGOM metabolites, using gene expression as inputs. Weused 10×5 nested cross-validation to produce 50 estimates. The stars representstatistical significance from a Bonferroni-corrected Wilcoxon rank-sum test p ≤0.05/35 = 0.00143. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1968.16. Average time to run fmpr over 50 independent replications. (a) p = 400,K = 10. (b) N = 100, K = 10. (c) N = 100, p = 100. The left panel ineach subplot show the wall time, the right panel show time scaled to the sameapproximate range in order to better show the trends. . . . . . . . . . . . . . . 198A.1. Internal validation (mean and 95% CI for AUC) for centroid classifier with RFE207A.2. Internal validation (mean and 95% CI for AUC) for SVM classifier, using allfeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208A.3. Internal validation (mean and 95% CI for AUC) for PAM classifier . . . . . . . 209A.4. Internal validation (mean and 95% CI for AUC) for VV1 classifier . . . . . . . 210A.5. Internal validation (mean and 95% CI for AUC) for VV2 classifier . . . . . . . 211A.6. External validation (mean and 95% CI for AUC) for all models . . . . . . . . . 212A.7. Kolmogorov-Smirnov plots for overlap between the gene sets and the modulesof Desmedt et al. (2008). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213A.8. Heatmap of selected genes in the combined dataset (932 samples), showing thethree subclasses ER−/HER2−, ER+/HER2−, and HER2+. . . . . . . . . . . . . 214xxi

List of FiguresB.1. APRC for HAPGEN simulations, using either lasso squared-hinge loss models(lasso) or the univariable logistic regression Wald test (univariable). For thelasso, different numbers of SNPs are allowed in the model, as determined by thepenalty λ. For the univariable test, all SNPs are considered. For lasso, resultsare smoothed using LOESS over the replications. For univariable, results areaveraged over the replications. The dotted vertical lines show the number oftrue “causal” SNPs in the data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 225B.2. AUC for HAPGEN simulations, using either lasso squared-hinge loss models(lasso) or the univariable logistic regression Wald test (univariable). For thelasso, different numbers of SNPs are allowed in the model, as determined by thepenalty λ. For the univariable test, all SNPs are considered. For lasso, resultsare smoothed using LOESS over the replications. For univariable, results areaveraged over the replications. The dotted vertical lines show the number oftrue “causal” SNPs in the data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 226B.3. The first 5 principal components (PCs) of (a) original Celiac1 data and (b)after removing high LD regions, thinning, and regression of previous SNPs.The strong structure in the top PCs is largely removed by accounting for LD.PCs 6–10 were only weakly predictive of the phenotype and are not shown forclarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227B.4. PCA loadings per chromosome for each of the top 10 PCs (a) original Celiac1data (b) pruned Celiac1 data. Note the different scales on the y-axis. . . . . . 228B.5. 10-fold cross-validated AUC for prediction of case/control status from the top10 principal components of the Celiac1 dataset, using lasso logistic regressionwith glmnet (Friedman et al., 2010), selecting increasing number of principalcomponents (right to left) (a) original dataset and (b) after LD-pruning. . . . 229B.6. LOESS-smoothed (with 95% pointwise confidence intervals about the mean)AUC for lasso models of stringently-filtered (a) Celiac1 and Celiac2-UK and(b) WTCCC-T1D, both in 30 × 3-fold cross-validation. . . . . . . . . . . . . . . 230B.7. LOESS-smoothed AUC for models in 20×3-fold cross-validation. . . . . . . . . 231B.8. Averaged PPV/NPV for models in 20×3-fold cross-validation. . . . . . . . . . 232B.9. Summary plots of one fold of cross-validation prediction in the WTCCC-T1Ddata. The fourth panel shows the PPV in rank order of NPV, to better highlightthe samples with PPV=1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233B.10.Summary plots of one fold of cross-validation prediction in the Celiac1 data.The fourth panel shows the PPV in rank order of NPV, to better highlight thesamples with PPV=1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234xxii

List of FiguresB.11.Summary plots of one fold of cross-validation prediction in the Celiac1 dataafter stringent filtering. The fourth panel shows the PPV in rank order ofNPV, to better highlight the samples with PPV=1. . . . . . . . . . . . . . . . 235B.12.Summary plots of one fold of cross-validation prediction in the Celiac2-UKdata. The fourth panel shows the PPV in rank order of NPV, to better highlightthe samples with PPV=1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236B.13.Summary plots of one fold of cross-validation prediction in the Celiac2-UKdata after stringent filtering. The fourth panel shows the PPV in rank orderof NPV, to better highlight the samples with PPV=1. . . . . . . . . . . . . . . 237B.14.LOESS-smoothed proportion explained phenotypic variance, over 20×3-foldcross-validation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238B.15.LOESS-smoothed AUC for lasso squared-hinge loss classifier and logistic regressionfor random subsamples of the T1D data. For each prespecified sizeN ∈ {50, 100, 200, 400, 800, 1600, 3200}, we randomly sampled the original 4901samples (without replacement) to form a smaller dataset. The subsamplingwas repeated 30 times for N = 50, 20 times for N = 100, and 10 times for therest. Within each subsampled dataset, we ran 10 × 3CV to evaluate the AUC(for example, 30 × 10 × 3CV for N = 50). For N = 4901, we used the originaldataset without sampling, running 20 × 3CV. . . . . . . . . . . . . . . . . . . . 239B.16.Principal Component Analysis (PCA) of the cases only, using the top 100SNPs identified by the lasso for the Celiac1 and Celiac2-UK datasets, andtheir stringently-filtered versions. Samples are colored by median specificity inthe cross-validation replications: median specificity≥ 0.99 (red), and the rest(black). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240B.17.Principal Component Analysis (PCA) of the cases only, using the top 100 SNPsidentified by the lasso for T1D. Samples are colored by median specificity inthe cross-validation replications: median specificity≥ 0.99 (red), and the rest(black). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241B.18.AUC, PPV/NPV, and explained phenotypic variance for Bipolar Disease. . . 243B.19.AUC, PPV/NPV, and explained phenotypic variance for Coronary Artery Disease.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244B.20.AUC, PPV/NPV, and explained phenotypic variance for Celiac1. . . . . . . . 245B.21.AUC, PPV/NPV, and explained phenotypic variance for Celiac2-UK. . . . . . 246B.22.AUC, PPV/NPV, and explained phenotypic variance for Crohn’s. . . . . . . . 247B.23.AUC, PPV/NPV, and explained phenotypic variance for Hypertension. . . . 248B.24.AUC, PPV/NPV, and explained phenotypic variance for Rheumatoid Arthritis. 249B.25.AUC, PPV/NPV, and explained phenotypic variance for Type 1 Diabetes. . 250B.26.AUC, PPV/NPV, and explained phenotypic variance for Type 2 Diabetes. . 251xxiii

List of FiguresC.1. Time to run fmpr over 50 independent replications. (a) p = 400, K = 10. (b)N = 100, K = 10. (c) N = 100, p = 100. . . . . . . . . . . . . . . . . . . . . . . . . 253xxiv

List of Tables4.1. Clinical and demographic characteristics of the patients in the five breast cancerdatasets. Samples were removed if they were censored before the 5-year cutoffor were treated with adjuvant therapy. The clinical summaries are for thecleaned version of the data (post-removal). Grade: histologic grade. Therapy:neoadjuvant or adjuvant therapy. 1Q: first quartile; Med: median; 3Q: thirdquartile. ER status: estrogen receptor status. . . . . . . . . . . . . . . . . . . . 644.2. The gene set statistics used in this work. . . . . . . . . . . . . . . . . . . . . . . 714.3. Top 10 gene sets by average rank over the five datasets, using the set centroidstatistic. GO enrichment p-values are from a Bonferroni-adjusted one-sidedFisher’s exact test (30,330 tests). Sign=−1 if expression is negatively associatedwith long-term survival, and is +1 otherwise. The background list for the testincludes all Affymetrix HG-U133A probesets that could be mapped to GO BPterms, excluding IEA annotations. . . . . . . . . . . . . . . . . . . . . . . . . . . 804.4. Breakdown of samples for each cancer subtype . . . . . . . . . . . . . . . . . . . 854.5. Top 10 MSigDB sets for ER/HER2 molecular subtypes, chosen by the centroidclassifier using the set centroid statistic. Sign=−1 if expression is negativelyassociated with long-term survival, and +1 for positive association with longtermsurvival. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 904.6. Top 10 sets using the set centroid statistic using different classifiers, and thep-value for the size of the intersection between the top individual genes andthe top gene sets (Fisher’s exact test, one sided). CC is centroid classifier, LRis logistic regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 916.1. List of discovery datasets used in this analysis. The 1958 British Birth Cohort(N = 1480) and the National Blood Service (N = 1458) datasets wereused as shared controls for all WTCCC datasets.† Celiac1 used IlluminaHumanHap33v1-1 for cases and HumanHap550-2v3 for controls, and Celiac2-UK used Illumina 670-QuadCustom-v1 for cases and Illumina 1.2M-DuoCustomv1for controls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128xxv

List of Tables6.2. List of independent replication datasets used. The National Blood Service†(N = 1458) dataset was used as controls for the GoKinD-T1D dataset.Celiac2-IT and Celiac2-NL used Illumina 670QuadCustom-v1 for cases andcontrols, Celiac2-Finn used 670-QuadCustom-v1 for cases and 610-Quad forcontrols. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1296.3. AUC and explained phenotypic variance for independent validation datasetsof celiac disease models trained on Celiac1. We used models with ∼200 SNPsin the model, trained in cross-validation on Celiac1 and tested on subsets ofthe Celiac2 dataset. LCL: lower confidence limit. UCL: upper confidencelimit. The proportion of explained phenotypic variance assumes populationprevalence K = 1%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1326.4. AUC and explained phenotypic variance for independent validation datasetsof celiac disease models trained on Celiac2-UK. Models were trained in crossvalidationon the UK subset of the Celiac2 datasets, and tested on the otherthree subsets of the Celiac2 dataset. LCL: lower confidence limit. UCL: upperconfidence limit. The proportion of explained phenotypic variance assumespopulation prevalence K = 1%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1346.5. Models were trained in cross-validation on the WTCCC-T1D dataset andtested on the GoKinD-T1D dataset, using ∼ 100 SNPs in the model. The95% confidence interval is derived from the LOESS fit. LCL: lower confidencelimit. UCL: upper confidence limit. The proportion of explained phenotypicvariance assumes population prevalence K = 0.54%. . . . . . . . . . . . . . . . 1347.1. The marginal and conditional independence statements that can be derivedfrom the (SNP, Gene, Metabolite) graph, and the corresponding correlationand partial correlations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1497.2. The stable predictive genes selected for each metabolic cluster (appeared inthe lasso model ≥ 60% of the cross-validation replications). “-” indicates thatno genes were stably selected in this cluster. . . . . . . . . . . . . . . . . . . . . 1577.3. trans-QTLs for genes associated with the metabolites predictive of fasting glucoselevels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1647.4. Genomic inflation factors for genes associated with metabolites predictive offasting glucose, based on the median χ 2 statistics from the linear model ofassociation in PLINK. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167A.1. of external-validation AUC different numbers of features. The AUC for individualgenes is used as the intercept. . . . . . . . . . . . . . . . . . . . . . . . . . 215xxvi

List of TablesB.1. The confusion matrix of predicted versus actual classes. “True” is truly causalSNPs, “False” is non-causal SNPs, Ŷ = 1 and Ŷ = 0 are predictions of causaland non-causal SNPs, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . 217B.2. Genomic inflation factors λ estimated by PLINK v1.07 using the median ofstatistics for either the 1-df χ 2 test (--assoc --adjust) or the logistic regressiontest (without covariates, --logistic --adjust). . . . . . . . . . . . . . . 219B.3. Population prevalence for each disease as used in this work. . . . . . . . . . . . 221B.4. AUC and proportion of phenotypic variance explained for GCTA (Yang etal., 2011), using 3-fold cross-validation (CV). AUC was derived from the persamplescores in the test folds for each cross-validation fold. The 95% confidenceinterval is from a one-sample t-test, and explained variance (includingthe confidence intervals) is estimated from the AUC and prevalence K usingthe method of Wray et al. (2010). The column denoted N is the number ofAUC values estimated in cross-validation — each 3CV produces N = 3 AUCvalues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223B.5. AUC estimated in 10 × 10-fold cross-validation on chr6 in the Celiac1 dataset(2200 samples, 19,169 SNPs). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224B.6. BD dataset, autosomes only. Prevalence from Bebbington and Ramana (1995);Wray et al. (2010). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243B.7. CAD dataset, autosomes only. Prevalence from Wray et al. (2010). . . . . . . 244B.8. Celiac datasets, autosomes only. Prevalence from van Heel and West (2006). 245B.9. Crohn’s dataset, autosomes only. Prevalence from Carter et al. (2004); Wrayet al. (2010). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247B.10.HT dataset, autosomes only. Prevalence from NHS (2010). . . . . . . . . . . . 248B.11.RA dataset, autosomes only. Prevalence from Wray et al. (2010). . . . . . . . 249B.12.T1D dataset, autosomes only. Prevalence from Wray et al. (2010). . . . . . . 250B.13.T2D dataset, autosomes only. Prevalence from Wray et al. (2010). . . . . . . 251xxvii

List of AbbreviationsAUCFNFPGEOGiBGOarea under receiver operating characteristic curvefalse negativefalse positiveGene Expression Omnibusgebi byte, 2 30 bytesGene OntologyGWAS Genome-wide association studyLDMbMiBlinkage disequilibriummega basemebi byte, 2 20 bytesmmol/L millimolar per litreMSENPVPCAPPVPRCROCSNPSVMTiBTNTPmean squared errornegative predictive valueprincipal component analysispositive predictive valueprecision-recallreceiver operating characteristicSingle nucleotide polymorphismSupport vector machineTebi byte, 2 40 bytestrue negativetrue positiveWTCCC Wellcome Trust Case-Control ConsortiumeQTL expression quantitative locixxix

1IntroductionThe development of high throughput technologies for assaying the molecular characteristics oftissues and cells has been transforming the biological sciences for the past decade or so, makingthem increasingly quantitative. Modern technologies such as gene expression microarrays,single nucleotide polymorphism (SNP) microarrays, epigenetic marker arrays, whole genomesequencing, and high throughput metabolomics, all generate a wealth of data. These datasetshave been immensely useful for characterising genetic, transcriptomic (gene expression), andmetabolomic variation across individuals and populations, and relating this variation to observedphenotypes such as disease. For example, gene expression datasets allow us to assesshow predictive is gene expression of phenotypes such as breast cancer metastasis (van ’t Veeret al., 2002), and which genes and pathways are responsible for these cellular processes. WithSNP datasets, we can assess which SNPs are strongly associated with disease, which genesare likely affected by these SNPs, and more generally evaluate the genetic architecture of eachphenotype (Manolio et al., 2009) and the strength of the genetic component in the overallobserved variation of the phenotype.Once the non-trivial technical challenge of obtaining the data has been overcome, the nextchallenge is extracting useful and relevant biological information from the data. Due tothe large size and complexity of such datasets, manually detecting and interpreting patternsin them is beyond the ability of any human, and data analysis is increasingly relying oncomputational and statistical methods to extract meaningful insight, either for diagnosticpurposes or for generating hypotheses of the underlying biology that can later be verified inthe lab. This thesis deals with the computational and statistical aspects of modelling large1

1. Introductionmolecular marker data, across several domains, including gene expression data, SNP data,and metabolite data, focusing on efficient and effective methods for analysing the data, whilemaintaining biological interpretability.More broadly, the topics of this thesis are related to the goal of “personalised medicine”,which aims to diagnose and treat patients based on their own genomic information, at a levelfar more specific and detailed than has been previously possible by relying solely on traditionalclinical variables such as age, sex, and disease symptoms. One such example is the use ofgenomic profiles to identify cancer subtypes and consequently to prescribe different drugsto cancer patients based on their subtype (Chin et al., 2011; Schilsky, 2010). Personalisedmedicine presents technical challenges for current methods in computer science and machinelearning (Fernald et al., 2011): First, the large size of the many datasets requires carefuldesign of algorithms for processing and storing the data. Second, interpreting any patternsor associations in terms of function and effect on the phenotype is challenging. Third, thereis the challenge of integrating diverse data types, each capturing a unique aspect of the data,into a coherent model of the underlying biological processes. Finally, there is the challengeof translating mathematical models to clinically relevant and actionable insights.Main Themes of This ThesisThere are several key challenges with computational and statistical analysis of molecularbiology data, which motivate the ideas developed in this thesis.Prediction We employ statistical models of molecular marker data trained on data where thephenotype is known (supervised learning). The main criteria we use to evaluate our statisticalmodels is their predictive ability with respect to the phenotype being modelled: how well doesa model predict the phenotype given some input such as gene expression or SNPs? Predictiveability quantifies the degree of association between the inputs (genotypes or gene expression)and the outputs (phenotypes). Competing models are compared by assessing their predictiveability. High predictive ability can also be useful from the practical sense, for example, forclinical diagnosis of which patients are at higher risk of breast cancer metastasis and relapse(discussed in Chapter 4). Predictive ability can be measured using different statistics: forclassification we might employ accuracy, area under receiver operating characteristic curves,and precision recall curves, whereas for regression we might use the mean square error orR 2 . Therefore, the choice of the appropriate predictive measure is an important part of themodelling process.Interpretability Beyond predictive ability, having interpretable models is an important consideration,as one of the main goals of biology is to generate plausible mechanistic explanationsof the cellular processes underlying health and disease. A model that has very high predictive2

ability but is difficult to interpret may be less useful than a less predictive model that isamenable to biological interpretation and provides insights into disease etiology. Hence, interpretationis another theme of this thesis. All the models we employ in this thesis are linearmodels (or transformations of linear models, such as logistic regression), where the weightsgiven to each input (marker) are directly interpretable as their contributions to the overallmodel. Particularly, we use lasso models (Chapters 5, 6, and 7), which are sparse models,leading to a selection of a relatively small number of markers that enter the model with anon-zero weight. This is in contrast with other approaches, for example kernel methods,which might achieve good prediction but where interpretation in terms of the contributionsof individual inputs is difficult.Scalability A good modelling approach is of no practical use if it cannot be applied toreal data. The modelling approaches should be computationally tractable and scalable, asdatasets are rapidly increasing in size both in terms of samples and in terms of markers. Forexample, genome-wide association studies now routinely include tens of thousands of samplesassayed over more than half a million single nucleotide polymorphism (SNP) markers, andarray sizes of around two million markers are now becoming available. The algorithms weuse to fit the models in this thesis are efficient and scalable to large datasets, allowing usto successfully model current SNP datasets involving thousands of samples and hundreds ofthousands of markers, where other approaches may fall short or require much more computationalresources.Data Integration Multiple types of data can be assayed over the same samples. For example,The Cancer Genome Atlas (TCGA, http://cancergenome.nih.gov), contains SNP, gene expression,copy number, methylation, micro-RNA, and other datasets assayed over the samesamples, for several cancer types. While each dataset provides valuable information about theunderlying biological mechanisms of disease, these are disparate views of the same cellularprocesses. Potentially greater insight is produced by integrating the data into a coherentbiological model, where health and disease are considered to be outcomes of complex interactionsbetween genetic variation, epigenetics, and environment. Chapter 7 is a case study,where we employ sparse linear models to perform an integrated analysis of the DILGOMmulti-omic dataset (Inouye et al., 2010b), consisting of SNPs, gene expression, metabolites,and clinical variables, inferring causal networks of SNPs and genes affecting metabolite levelsand associated with fasting glucose, a clinical marker for type 2 diabetes. The DILGOMdataset is currently one of the largest datasets of this type. Another form of data integrationis analysis of datasets where multiple correlated phenotypes are assayed over the samesamples. Algorithms for analysis of correlated phenotypes are discussed in Chapter 8.3

1. IntroductionThesis Outline and ContributionsChapter 2 — Biological Background is a review chapter, introducing the basic biologicalconcepts used in this thesis, beginning with topics such as the central dogma of molecularbiology, how gene expression is measured using microarrays, and challenges in analysis andinterpretation of gene expression experiments. We then discuss genetic data, particularlysingle nucleotide polymorphism (SNP) data. We highlight several competing hypotheses forthe genetic basis of disease, and discuss genome wide association studies (GWAS) which areunbiased scans of SNP data for association with phenotypes. Finally, we discuss the problemof missing heritability, and expression quantitative trait loci (eQTL) where gene expressionitself is used as a phenotype in a GWAS.Chapter 3 — Review of the Analysis of Gene Expression and Genetic Data discusses someof the main concepts in statistics and supervised machine learning that form the basis for therest of the work in this thesis. We emphasise the roles of feature selection, and especially ofsparse penalised methods such as the lasso, which are used heavily throughout the thesis formodelling data.Chapter 4 — Prediction of Breast Cancer Prognosis using Gene Set Statistics discussesthe problem of predicting future breast cancer metastasis and relapse based on gene expressiondata. This problem has been the focus of intense research, as successful prediction ofwhich women are at higher risk would have important clinical implications for many thousandsof women around the world, allowing more personalised treatment of the most commoncancer in Western women. While gene expression has been found to be moderately predictiveof metastasis risk, the prognostic genes found by different studies have so far been largelyinconsistent, raising doubts about the interpretation of these results and the underlying biologicalmechanisms. We propose an approach based on gene sets, rather than individualgenes, where membership of genes in a set is based on prior knowledge, such as the literature,large scale experiments, and curated pathways. The gene expression levels in each set areaggregated using a set statistic. The set expression levels are then used as inputs for standardclassification models. We evaluate five breast cancer datasets, comparing the set approachwith the standard approach based on individual genes. Our contributions include:• We propose a gene set statistic framework, which produces gene signatures based onsets of genes rather than individual genes.• We apply our method to five independent breast cancer datasets, evaluating multiplevariants of the gene set method.• We demonstrate that the gene set approach produces more robust and consistent prognosticsignatures than those based on individual genes.4

• We show that the top predictive sets are highly biologically interpretable, consistingof genes belonging to known pathways associated with the cell cycle and metastasisprocesses.Chapter 5 — Fast and Memory-Efficient Sparse Linear Models deals with the problemof fitting sparse (lasso) linear statistical models to large genetic datasets. Such models canpotentially be useful for diagnostic purposes by predicting disease from genotype, for identifyingthe SNPs associated with the phenotype, and for estimating how much of the variabilityin the phenotype is due to genetic factors. However, existing general-purpose approachesare not well suited for this problem, as the datasets are large and require large amounts ofRAM to complete. We propose an efficient algorithm, named SparSNP, which operates inan out-of-core fashion, allowing it to rapidly fit lasso classification and regression models tolarge SNP data, while using low amounts of memory. We compare our approach with severalother state of the art methods for fitting lasso models, using a case/control celiac diseasedataset. Our contributions include:• We develop an out-of-core algorithm for efficiently fitting lasso models to large SNPdata, either for classification or for regression.• We evaluate our implementation sparsnp, compared with several state of the art methodson real genetic data, demonstrating that our method is faster than existing approachesand more scalable in terms of memory requirements.Chapter 6 — Sparse Linear Models Explain Phenotypic Variation and Predict Risk of ComplexDisease applies the lasso models developed in Chapter 5 to the problem of case/controlprediction in eight datasets of human complex disease, such as type 1 and 2 diabetes and celiacdisease. Our contributions include:• We compare lasso models with univariable methods for the task of detecting associatedgenetic variants with univariable methods, and analyse the strengths and weaknesses ofeach approach.• We perform an analysis using lasso models of eight complex human diseases, includingceliac disease, type 1 and 2 diabetes, Crohn’s disease, bipolar disorder, hypertension,rheumatoid arthritis, and coronary artery disease. For each disease, we characterisehow well it can be predicted from the genetic data and how much phenotypic variancecan be explained by the data.• We propose that celiac disease and type 1 diabetes may have genetic subtypes thatexhibit more genetic predictability than others, and assess the importance of thesefindings in the population-wide setting.5

1. IntroductionChapter 7 — Characterising the Genetic Control of Human Metabolic Genes leveragesthe lasso models developed earlier in the setting of prediction of quantitative traits. Weperform an integrated analysis of a mult-omic dataset (DILGOM) consisting of SNPs, geneexpression, clinical variables, and serum metabolites, with the aim of deriving insights intothe genetic control of metabolites mediated by genes. Our contributions include:• We identify genes highly associated with metabolite levels, characterising the degreeto which each metabolite can be predicted. Many of these genes are known to beassociated with metabolism, however, we also identify genes with strong associationsbut previously unknown function.• We identify and characterise SNPs that are likely to regulate some of these predictivegenes, both in cis and in trans.• We associate fasting glucose levels, a clinical marker for type 2 diabetes, with metabolites,and infer causal networks of genetic regulation for these metabolites, mediated bygene expression. These networks represent novel hypotheses that may explain some ofthe genetic basis of type 2 diabetes.Chapter 8 — Fused Multitask Penalised Regression proposes a novel statistical frameworkfor regression and classification in the multi-task (multiple phenotype) setting, termedFused Multitask Penalised Regression (fmpr). Examples of such settings include predictionof multiple metabolite levels from gene expression and prediction of expression of multiplegenes from genetic variants. Our method leverages the correlations between outputs to producesparse models, assuming that correlated outputs are due to shared inputs. In contrast,most existing methods, such as the lasso, ignore such relatedness and treat each phenotypeseparately. Our contributions include:• A novel multitask model for modelling correlated outputs.• An algorithm and efficient implementation of our multi-task method.• A comparison of our method with existing single-task and multi-task methods in simulation,demonstrating the usefulness of our approach both in prediction accuracy andin recovering the true causal inputs.• An evaluation of our method on real data involving gene expression and metabolites,demonstrating that our approach results in better predictive models than the lasso.Chapter 9 — Conclusionsbe extended in the future.concludes this thesis, and discusses ways in which this work can6

SummaryIn summary, this thesis is concerned with effective and efficient supervised learning methodsfor modelling molecular marker data, such as gene expression levels, metabolite levels, andSNPs. This work shows the feasibility of sparse models for analysis of large datasets, and theutility of these models for modelling human gene expression and SNP data, both for predictionof the phenotype and for biological interpretation of the underlying cellular mechanisms. Wealso demonstrate the increased biological insight gained from an integrated analysis of multiomicdatasets, combining multiple sources of data assayed on the same samples. Overall,this work advances the possibility of developing better predictive models of human disease,bringing the potential benefits of personalised medicine closer to realisation.7

2Biological BackgroundIn this chapter we survey some of the basic biological concepts and terminology used in thisthesis. Our discussion mainly uses human disease as the phenotype of interest, but many ofthe underlying mechanisms hold more generally across other phenotypes and other organisms.In addition, we limit our discussion to the genomic components of disease, and do not examineenvironmental factors or interactions of genes with the environment.2.1. The Central Dogma of Molecular BiologyThe central dogma of molecular biology is that information in the cell flows in one direction,starting with DNA, as illustrated in Figure 2.1. DNA (deoxyribonucleic acid) is composedof four bases: A (adenine), G (guanine), C (cytosine), and T (thymine). DNA is dividedinto two types of regions — genic and intergenic, where the former contains genes and thelatter do not. The genic regions are composed of excised regions (exons) and interveningregions (introns). The first step in the information flow, therefore, is when exons are excisedto precursor-mRNA which is then spliced to form messenger RNA (ribonucleic acid). Second,the ribosomes translate the transcribed mRNA sequence to protein sequence by chainingtogether amino acids, one for each mRNA molecule. Third, the completed protein is releasedand is then free to perform some intra-cellular or extra-cellular action, for example as anenzyme or as a structural protein in the cell. Note that the entire process of expressionbegins with “step zero”, when a protein called a transcription factor (TF), binds to thepromoter region of a gene, initiating a complex chain of events in which the exons are excised9

2. Biological BackgroundDNAmRNAproteinphenotypeFigure 2.1.: The Central Dogma of molecular biology.and mRNA is produced. The system is self regulating — DNA codes for transcription factorsthat bind to DNA and modulate the expression of other genes and their protein products,including other transcription factors, thus creating a closed loop, an essential component ofself regulation (Alon, 2007).A mutation in the DNA, due, for example, to imperfect replication or to ionising radiation,may change the amino acid sequence of the protein produced, or may lead to under- or overproductionof certain proteins. These molecular-level conditions may manifest through whatwe perceive as disease. A well-known example is the mutation in the gene HBB in humans,causing misfolding of the protein beta-globin, which in turn manifests as the disease knownas sickle cell anaemia 1 .Although this four-step model of cellular information flow is known to be a crude oversimplificationof a complex reality, it is a useful mental model nonetheless, as long as we aremindful of its limitations. We expand this basic model throughout this chapter, incorporatingother known mechanisms of information flow.2.2. The Molecular Basis for DiseaseTo date, there have been two major efforts in the search for cellular phenomena associatedwith disease. The first, which we may term “transcriptomic”, has been in the area of geneexpression. Under the central dogma, over- or under-expressed genes are an important step ina complex molecular cascade eventually leading to what we perceive as disease. Therefore, inthe transcriptomic approach, we search for associations of genes with some observed phenotype1 http://www.ncbi.nlm.nih.gov/omim/60390310

2.2. The Molecular Basis for Diseasesuch as the disease itself or some other clinical measurement (for example, tumour size orsurvival time), with the aims of implicating genes that might be responsible for the conditionand potentially developing diagnostic and prognostic tools. Since gene expression manifestsitself through mRNA in the cell, what we actually measure is mRNA levels, using geneexpression microarrays.The second effort, which we term “genetic”, has been to characterise which DNA lociharbour variations that are strongly associated with the phenotype. Out of the possibleDNA variations, we concentrate on single-nucleotide polymorphisms (SNPs). Other variationsinclude copy number variations and chromosomal inversions.SNPs are single-base DNA loci that have several variants, usually two (biallelic). SNPs canoccur anywhere in the DNA: in intergenic regions, in exons, and in introns. The variationmay or may not have any effect in the cell, depending on the functional importance of whereit occurred and the nature of the variation itself — exonic variations that do not change theresulting protein are synonymous, whereas those that alter the protein are non-synonymous.There are significant differences between the transcriptomic and genetic approaches. Incontrast with a SNP that is a physical variation in one DNA nucleotide, a gene is a conceptualannotation of a region of DNA, typically thousands of nucleotides long, that usuallyencodes for a protein. Each gene has its own regulatory region, the promoter, where transcriptionfactors can bind and regulate the gene’s expression. Therefore, the difference betweenanalysing gene expression and analysing SNPs is that with the former we are asking howactive is the gene? whereas with the latter we are asking which variant of the DNA do wehave? and, assuming that the SNP is known to affect the gene, which variant of the gene dowe have? Another important difference between the transcriptomic and genetic approachesis that mRNA levels are merely snapshots of the highly dynamic cellular activities and aredependent on factors such as which tissue was measured and time. Changes to mRNA levelsoccur at time-scales ranging from minutes to days or months (Alon, 2007), depending onwhich cellular mechanism they belong to. In contrast, (non-somatic) genetic mutations arebelieved to be largely immutable once they have occurred, are passed from one generationto the next unless they are highly lethal, and occur over time scales of multiple generations,which means decades for humans. Currently, there are estimated to be about 30,000 humangenes, whereas the HapMap (International HapMap 3 Consortium, 2010; InternationalHapMap Consortium, 2005, 2007) and 1000Genomes (1000 Genomes Project Consortium,2010) projects have to date mapped more than 30 million human SNPs, and the number islikely to rise as more diverse human populations are sequenced.These differences between transcriptomic and genetic data, together with the fact that basicgenetic theory dates back to Mendel in the 19th century, before the discovery of the DNAand the understanding of basic genomic mechanisms in the 1950s, have meant that transcriptomicand genetic data have largely been analysed separately and using different methods, as11

2. Biological Backgrounddiscussed in Chapter 3. Recently, more integrative approaches have been proposed, such asanalysis of expression quantitative loci (eQTL), that integrate transcriptomics and geneticsin order to produce a clearer picture of cellular mechanisms; we discuss these in Section 2.4.8.2.3. Gene ExpressionGenes do not affect phenotype directly. Rather, their effects are mediated by mRNA andprotein. Since DNA mutations are known to cause changes in observed phenotypes in organisms,it is reasonable to examine whether any genes exhibit changes in activity patternsthat can be associated with phenotype. Gene activity, termed gene expression, is inferredthrough measuring the levels of mRNA in the cell. In the simplest conceptual model, genesare thought of as being “ON” (highly expressed) or “OFF” (low or basal expression). Moresophisticated models may be more fine grained, considering genes on a continuous scale. Differentgenes show different basal levels of expression and different dynamic ranges. Therefore,gene expression levels are usually not compared directly, but are expressed as difference inexpression between two phenotypic states — differential expression. Genes with significantdifferential expression between two conditions (case/control) are candidates for the genes thatcarry the mutation or are otherwise potentially important in the underlying molecular mechanismleading to the phenotype. Statistical analysis of differential expression is discussedfurther in Chapter 3.2.3.1. Measuring Gene ExpressionGene expression microarrays are the tools used to measure relative or absolute concentrationsof mRNA in tissue, and thus to infer gene activity levels. The arrays themselves are made ofthin pieces of glass or plastic, and on their surface there are thousands of spots with short DNAsequences. Early microarrays were usually of the two-channel spotted cDNA (complementaryDNA) type, usually custom-made, whereas modern microarrays are typically commercialhigh density oligonucleotide arrays, such as those produced by Affymetrix (Santa Clara, CA,USA). Whereas spotted arrays tend to contain several thousand relatively long probes, modernmicroarrays contain tens of thousands of probesets; the human HG-U133plus2.0 by Affymetrixcontains roughly 50,000 short probesets. A probeset is a set of 16–20 short DNA probesused to measure one transcript. Each probeset contains perfect match (PM) and mismatchprobes (MM), designed to enable calibration of expression accounting for effects such asnon-specific binding (see Section 2.3.2). With all arrays there is not necessarily a one-toonecorrespondence between genes and probes, and post-processing is needed to map fromprobesets to genes.As outlined in Figure 2.2, the basic stages of a microarray experiment are:12

2.3. Gene Expression• mRNA extraction mRNA is retrieved from tissue samples, either from samples suchas biopsies or blood, or from cell lines. Since gene expression levels differ between tissues,mRNA must be extracted from the relevant tissue for each experiment. Once mRNAhas been extracted, complementary copy of the mRNA is created from it (A→T, G→C,and vice versa). For spotted arrays, cRNA samples from two phenotypes (for example,tumour and normal) are treated slightly differently; samples from one phenotype arelabelled with red fluorescent dye (Cy5), whereas those from the other phenotype arelabelled with green (Cy3). cRNA for oligonucleotide arrays undergoes biotinylation(labelling with biotin).• Hybridisation and washing The labelled cRNA is hybridised to the arrays, a processin which some of the cRNA binds to the probes on the array and the remaining unboundmaterial is washed off. When the sample contains more cRNA, more of it binds to theprobes, and vice-versa. For oligonucleotide arrays, this stage includes staining withstreptavidin-phycoerithrin which binds to the biotin. For spotted arrays, both matchedsamples, red and green, are hybridised to the same array.• Measurement The arrays are scanned using a laser scanner that measures the amountof fluorescence at each spot on the array. The fluorescence is proportional to the numberof cRNA molecules bound to the spot. For Affymetrix arrays, the image is stored asa CEL file, which contains image intensities and various experimental parameters. Ingeneral, oligonucleotide arrays measure absolute hybridisation intensities, and at leasttwo arrays, one for each sample, are needed for the purpose of estimating differencesin gene expression (differential expression). In contrast, spotted cDNA arrays measurerelative hybridisation, since each at spot both the Cy3 and Cy5 labelled cRNA binds,albeit at potentially different levels. The ratio of hybridisation intensities is then usedto estimate differential expression.• Preprocessing After raw intensities have been determined, a crucial step in analysis ofgene expression microarrays is preprocessing of the data. Preprocessing involves qualitycontrol in which arrays with many faulty probes are discarded, and normalisation, whichis statistical correction for potentially confounding but biologically uninteresting artefactsdue to measurement variation within and between the arrays (Smyth and Speed,2003) (see Section 2.3.2 for further discussion). A common post-processing method foroligonucleotide arrays is the Robust Multichip Average (RMA) (Bolstad et al., 2003;Irizarry et al., 2003a,b).• Analysis After preprocessing, gene expression data is converted to log 2 intensities,since this reduces the dynamic range of the data and makes the noise (variability) approximatelynormally distributed. Now the data can be used for tasks such as searching13

2. Biological BackgroundGlass slide arrayMicroarrays for gene expressionFJT Staal et alAffymetrix Gene Chip1325RNA extractionIVTCTP-Cy3Cy3-labeledcRNAIVTCTP-Cy5Cy5-labeledcRNAcDNA reaction,purification andlabeling by IVTfragmentation2+(heat + Mg )IVTUTP-biotinCTP-biotinbiotin-labeledcRNAlabeled cRNAfragmentsglass slidearray(one cDNA or longoligonucleotidesper gene)hybridizationwashing+ staining withstreptavidin-PEAffymetrix array(multiple shortoligonucleotidesper gene)laser scanninggeneexpressionratioscomputeranalyses“absolute” geneexpression levelsBIOINFORMATICSFigure 1 Comparison of glass slide and Affymetrix microarray procedures. For glass slide experiments, two cell populations, for instancediseasedFigureand normal,2.2.:areAnisolated, outlineRNAof is extracted,the geneandexpression cDNA is made,microarraywhich is used forexperiment,in vitro transcriptionfor(IVT)spottedwith Cy3cDNA(green)(left)or Cy5 (red)labeled nucleotides. The and twooligonucleotide labeled cRNA samples arrays are mixed (right). and hybridized Reprinted on a glass byslide permission array, which is from scanned Macmillan with a laser, Pub-used in anLtd IVT reaction (Staal to generate et al., biotinylated 2003), copyright cRNA. After fragmentation, (2003). this cRNA is hybridized to microarrays, washedfollowed bycomputer analysis of the intensity image. With Affymetrix arrays, one population is used as starting material. Total RNA is extracted and cDNA isprepared. The cDNA islishersand stained with PE-conjugated streptavidin, and subsequently scanned on a laser scanner.Two different subtypes of glass slide microarrays can bediscerned, based on the use of cDNAs or oligonucleotides. ThecDNA microarrays have been around for some time and areoften produced by spotting PCR products on glass, latertransformed into single-strand (ss) DNA products by treatmentwith alkali expression or light. Given data. the problems with reproduciblespotting of products of different lengths and generating ssDNAproducts, glass slides with oligonucleotides have been produced.These oligonucleotides are typically 70 mers or 80 mers.This is much longer than the 25 mers used by Affymetrix, butgenerally only one oligonucleotide is used per gene without anyhybridization control.Glass slide microarray technology is readily amendable forrelatively large numbers of samples, not as expensive asAffymetrix technology, and can be set up by investigatorsthemselves using an array spotter. However, the technicaldifficulties in the reproducible production of one’s own glassslide microarrays should not be underestimated: conditions suchas moisture, temperature, and light intensities in the room of thespotter need to be carefully controlled, large numbers ofoligonucleotides of similar quality need to be present, and inthe case of cDNA glass slides roughly similar hybridizationconditions for each cDNA need to be found.for genes differentially expressed between two conditions, clustering of genes and samplesinto groups to find novel disease subtypes, and classification of the phenotype basedon the gene expression. See Chapter 3 for more discussion of methods for analysing geneAffymetrix microarrays2.3.2. Challenges in Analysis of Gene Expression MicroarraysAffymetrix microarrays (the so-called GeneChips 7 ) generate agene expression profile of one sample and therefore use onlyone color (phycorerythrin, PE, red). The design of the GeneChipsis such that expression of a gene is interrogated by several (11–20) 25-mer oligonucleotides that span a part of the gene(Figure 2). In addition to these perfect-match oligonucleotides,each 25 mer comes with a negative control oligonucleotide thatcontains a mismatch at position 13. The integration of theexpression levels for each of the 11–20 perfect-match–mismatcholigonucleotide sets generates a value for the expression of aparticular gene.There are several sources of error, variation, and confounding in gene expression experiments.These sources can largely be classified into two groups, extrinsic and intrinsic. Extrinsic factorsare ones that are not biologically insightful and potentially weaken or confound statistical14Leukemia

2.3. Gene Expressionanalyses, and should be eliminated as far as possible prior to analysis. Some extrinsic factorsaffecting gene expression experiments are:• Hybridisation noise and batch effects Hybridisation of cRNA to the arrays is achemical process and as such it is stochastic and dependent on external factors suchas temperature, sample age, and experimental conditions such as the concentrations ofthe reagents used in the process. The stochasticity implies that the measured amountof hybridisation varies between experiments, even when measured under very similarconditions. In addition, if different samples in a study were analysed under differentconditions or in different labs, then there can be batch effects in the data, that is,systematic differences in expression levels attributable to external factors, rather thandue to intrinsic differences in the expressed genes. Depending on their magnitude anddirection, batch effects can mask gene effects or create spurious gene effects in thedata (for example, in the extreme case where all cases are analysed in one lab andall controls in another). Hybridisation noise is typically mitigated by strict laboratoryprotocols and the averaging the data over a large number of samples. Batch effectscan be prevented through careful experimental design, such as randomising the split ofsamples between labs, or by strict analysis of the samples in the same lab under similarconditions. If batch effects are already present in the data, statistical correction can beused to some degree, however, it is more difficult than preventing the problem in the firstplace since the analysis requires making assumptions about the sources of variabilityand their relative importance, and these sources are not always well known. A lesscommon solution is to use technical replicates — repeated arrays of the same tissuesample, sometimes analysed by different labs. Technical replicates can be averaged toform a more stable estimate of the hybridisation for that sample.• Measurement noise The hybridised cRNA is stained or labelled with fluorescent dye,which is detected by the laser, and converted to a digital image. This process is notperfect, and errors can be introduced by the measuring device. Again, this problem isusually mitigated through quality control and the use of multiple samples.• Differential binding Different cRNA fragments have different binding affinities, asdetermined by the thermodynamics of the binding process, that are, in turn, dictatedby their chemical structure and environmental factors such as temperature. Therefore,at equal concentrations, certain fragments are more likely to bind to the probes thanothers. This may result in a biased estimate of gene expression, since the weaklybindingfragments will be measured as missing or having low expression relative to thestrongly-binding fragments.• Non-specific binding Each probe on an oligonucleotide microarray is a short fragmentof cDNA, 25 nucleotides long. A set of 16–20 probes forms a probeset. The probeset15

2. Biological Backgroundis intended to measure binding for one gene. However, some cRNA fragments can bindto more than one probe, even across probesets, potentially being measured as multiplegenes. This non-uniqueness in binding, also called cross-hybridisation, is a confoundingfactor in measuring gene expression. Oligonucleotide arrays such as those by Affymetrixtry to mitigate this problem by employing two types of probes, perfect match (PM)probes and mismatch (MM) probes. The PM probe should be bound by the true gene’scRNA, whereas the MM probe differs in its central nucleotide from the PM probe andis intended to be bound by non-specific fragments. Hence, the simplest approach toaccount for non-specific binding is to normalise each PM probe by its matching MMprobe. However, more sophisticated approaches have been developed, such as quantilenormalisation (Bolstad et al., 2003), and MM probes are now ignored or have beencompletely removed from recent arrays.• Faulty arrays Some arrays in an experiment may be faulty, for example, when manyof the probes did not hybridise well. Quality control and sometimes visual inspectionof results in each arrays are necessary to make sure these arrays are discarded prior toanalysis.• Experimental errors By experimental error we refer to sources of variation andnoise that are due to things such mislabelling of samples (cases labelled as controls andvice-versa), or measuring the same sample several times unintentionally. Some of theseerrors can be detected during the analysis, in the quality control stage (for example,detecting duplicated samples), however, others, such as mislabelling phenotypes maybe harder to detect and should be avoided in the first place.• Annotation variability Many experiments depend on some external annotation ofthe sample, for example, whether the sample comes from a cancer or normal patient.Some of these annotations can be variable, especially for annotations that are not experimentallymeasured but depend on a clinician’s subjective assessment. In such acase, this variability can be reduced by careful planning and documentation of the assessmentprocedure (the criteria for assigning the patient to a specific class), and byusing a variety of independent assessors.In contrast, intrinsic factors are ones that are potentially biologically meaningful and whichwe may wish to explicitly model them in their own right, once the external factors have beenaccounted for. Such intrinsic factors include:• Dynamic range Different genes perform different roles in the cell, and this dictatesthe range of expression levels they can take. In particular, mRNA from genes codingfor transcription factors is known to exhibit a small dynamic range, making it difficultto detect the differential expression signal over the background noise.16

2.3. Gene Expression• Tissue specificity Gene expression for some genes is highly dependent on the tissuetype, whereas so-called “house-keeping” genes are active in most tissues since they arerequired for the basic functioning of the cell. Therefore, gene expression experimentsthat assay the wrong tissue may fail to detect differential gene expression that is present.The physical distance between the “right” and “wrong” tissue might be very small,leading to situations where both types of tissue are assayed together and the resultingsample is a heterogeneous sample of many cell types, potentially reducing our ability todetect subtle changes in specific tissue types.• Time specificity Gene expression exhibits strong time dependence, on several timescales. For example, genes in certain bacteria respond to a change in the lactose levelsin the environment by expressing genes responsible for producing the lactase enzyme.The expression of this gene reaches a steady state over a timescale of minutes to hours.Once the lactose has been metabolised, the gene will stop transcribing and expressionwill gradually decrease as the mRNA decays. In contrast, other genes, responsible forembryogenesis, may only be active during that stage of development. On yet anothertimescale, genes related to the circadian cycle show cyclic patterns of expression overthe hours of the day. These differing dynamic patterns raise two issues. First, is theexperiment capturing the entire pattern? Most microarray experiments are snapshotsin time. Those that are taken across time (time course experiments) tend to be smallerdue to the extra effort in measuring more samples. Measuring expression over time isnot practical for certain experiments, such as those relying on human biopsies. Second,there is the issue of sampling frequency — is the experiment precise enough to measurethe events of interest? If the time-course experiment is spaced at intervals too far apart,it will fail to capture the higher frequency changes which may be biologically relevant.In practice, except for time course experiments, many gene expression experiments onlycapture a snapshot in time, when most gene expression has already reached steady state.• Heterogeneous samples Phenotypes that superficially appear the same may havedifferent underlying causes. For example, breast cancer has been shown to be a heterogeneousdisease, driven by different cellular mechanisms and exhibiting phenotypes suchas different degrees of aggressiveness and response to treatment (Loi et al., 2007; Perouet al., 2000; Sørlie et al., 2001; Sotiriou et al., 2006). On the one hand, by analysingsuch sub-populations together, we may increase the statistical power of the analysis(probability of detecting a true association) of detecting common genomic mechanisms,at the expense of detecting those mechanisms that are distinct. On the other hand,analysing them separately may result in sample sizes that are so small that they reducethe power to detect even the common causes. This is a consequence of the well-knownstatistical issue of bias-variance decomposition (Hastie et al., 2009a).17

2. Biological Background• Post-translational modifications A post-translational modification is a change tothe protein after it was produced by the ribosome. Modifications include events suchas addition and removal of amino acids, phosphorilation, methylation, and acetylation,which affect protein activation, localisation (which part of the cell the protein will betransported to), degradation, and ability to interact with other proteins (Mann andJensen, 2003). This phenomena again demonstrates that expression levels, measured asmRNA levels, may not directly correspond to protein levels.• Alternative splicing In alternative splicing, the pre-mRNA exons transcribed froma gene are spliced together in different ways, producing different mRNA molecules andeventually different proteins called isoforms (Blencowe, 2006). Each isoform may functiondifferently in the cell, and the production of each isoform can be tissue dependent.This phenomena further weakens the assumption that one gene codes for one protein,and that the gene can be assayed by one probeset — generally, each isoform requiresits own probeset. When several isoforms from the same gene are measured, they canprovide conflicting evidence regarding the expression of the gene, unless accounted for.Far from being a rare phenomena, alternative splicing is estimated to occur in abouthalf of all human genes (Modrek and Lee, 2002).• MicroRNA MicroRNAs are class of small RNA molecules that bind to mRNA anddegrade it, thus decreasing gene expression. MicroRNA are an important part of thecell’s protein regulation systems (Baek et al., 2008; Bartel, 2004); currently, there areseveral hundred known variants, and they occur commonly in the cell. As with alternativesplicing, microRNA shows that while a gene can be “active”, in the sense that it isbeing transcribed into mRNA, this does not necessarily translate to protein production.Moreover, gene regulatory interactions are not limited to protein coding genes — ourmental model of regulation through transcription factors must be expanded to includemicroRNAs as well.• Non-transcriptional regulation One of the most commonly studied mechanism forgene regulation is transcriptional regulation: the process in which a gene that codesfor a transcription factor is expressed, and the factor then modulates the expression ofanother gene. However, aside from transcriptional regulation there are at least two othermajor cellular regulatory systems (Alon, 2007), signalling regulation and metaboliteregulation. With signalling regulation, also called protein-protein interaction networks(PPI), protein interact with other, performing tasks such relaying information to thecell about its environment. Current knowledge of signalling regulation mechanisms isrelatively sparse, partly because these networks operate on much shorter timescales thangene expression and because protein levels are harder to assay on a large scale comparedwith gene expression. Metabolite regulation is the regulation of gene expression by18

2.3. Gene ExpressionepigeneticsDNAmRNAmicroRNAmetaboliteproteinclinical phenotypeFigure 2.3.: A revised Central Dogma of molecular biology. We distinguish between a clinicalphenotype, which is a high level phenotype such as case/control status, and otherphenotypes — many of the other nodes can be considered phenotypes in theirown right, such as gene expression (mRNA) in eQTL studies.metabolites inside and outside the cell, as in the bacterial lactose response discussedpreviously. Since many transcriptomic analyses examine gene expression in isolationfrom proteomic and metabolomic data, they produce an incomplete picture of the cell’sactivities. Some recent examples of studies that integrate several data sources, such asgene expression, metabolites, and genetic data, include Chen et al. (2008) and Inouyeet al. (2010a).• Pathways and topology Apart from the distinction between regulatory, signalling,and metabolic pathways, there is the issue of pathway topology, that is, the networkstructure. Gene networks can have different topologies, depending on their cellularrole (Alon, 2007). The observed complexity of gene networks is partly driven by evolutionaryselective pressure towards genetic buffering — stability of the phenotype inthe face of mutations (Moore, 2005). By not relying on any one gene for its operation,the system is more resilient to potential damage. While buffering confers evolutionaryadvantages, it makes the analysis of transcriptomic data more difficult, since it is harderto separate the contributions of each gene to the phenotype — a perturbation to theexpression of one gene may be insufficient to affect the phenotype. Moreover, some ofthe marginal contributions may be weak in themselves but constitute part of a largerepistatic mechanism. In the Chapter 4 we discuss several approaches for analysingtranscriptomic data from the pathway perspective.Having surveyed some of the biological phenomena that indicate that our picture of gene→ mRNA → protein is incomplete, we may consider a revised Central Dogma, taking into19

2. Biological Backgroundaccount the factors we have discussed: DNA, mRNA, microRNA, proteins, epigenetics (discussedin Section 2.4.6), and metabolites, as shown in Figure 2.3. The issue of what constitutesa phenotype is subjective; we may consider some high level manifestation of disease as thephenotype, or we may consider protein or metabolite concentrations as phenotypes as well. InSection 2.4.8 we discuss expression quantitative trait loci (eQTL) studies, where gene expressionitself is considered a phenotype driven by genetic factors, and other external phenotypesare considered to be downstream of the genes.Despite the limitations of gene expression experiments, and notwithstanding the fact thatgene expression is only a partial description of cellular activity, gene expression microarrayshave been a mainstay of modern molecular biology for three main reasons. First, gene expressionmicroarrays enable us to assay tens of thousands of probesets simultaneously, and gaininsight into the important mechanism of gene expression. Second, their low cost comparedwith alternative technologies such as RNA-seq (Wang et al., 2009) and proteomic methodshas made gene expression arrays attractive to researchers. Third, analysis of gene expressiondata has become relatively routine, and generally does not require specialised computing resourcesonce the data have been preprocessed, in contrast with sequencing data that mustundergo extensive postprocessing such as alignment and assembly.2.4. The Genetic Basis of DiseaseGenetics is concerned with inheritance of traits (phenotypes) — questions such as whichphenotypes are heritable, how heritable are they (the genetic component of the observedvariability), what are the mechanisms of heritability, and which mutations are responsible forimportant traits such as disease. DNA, ignoring for the moment the role of epigenetics, is themajor basis for heredity, being passed from parents to child. One important vehicle of heritabledisease is the single-nucleotide polymorphism (SNP), which is a population-level variation inone DNA base. (A SNP is usually referred to as a variant rather than a mutation, since wedo not know which of the variants is the wild type and which is the mutant.) Typically, onlySNPs that are common enough in the population are assayed in microarrays, as SNPs with lowMAFs require large sample sizes in order to be genotyped confidently. How common a SNPis in the population is measured by its minimum allele frequencies (MAF). To understandthe importance of SNPs in disease, we must first understand some basic genetic facts.In diploid organisms, such as humans, there are two copies of each chromosome, one fromeach parent (except for the sex chromosomes). Each chromosome in the pair is said toprovide one allele — one of the bases A, G, C, and T. For our purposes here, the actual basedoes not matter. Typically, we deal with biallelic variants, therefore one allele is arbitrarilydenoted as “A” and the other as “B” (the labels themselves do not imply any ordering ofthe alleles). Taken together, each individual has at one locus (DNA position) a pair of alleles20

REVIEWS2.4. The Genetic Basis of Diseasecross environments.s can be inflated ifitive genetic effectstion), shared familialns among genotypesstimated from pediilityestimated fromstimates from familyty of environmentalossible because thevides empirical estibetweenpairs of relalftheir genetic comlargestudy it rangedpic differences to theEffect size50.0HighIntermediateModestLow3.01.51.1Very rareRare allelescausingMendeliandiseaseRare variants ofsmall effectvery hard to identifyby genetic meansLow-frequencyvariants withintermediate effect0.001 0.005 0.05Rare Low frequencyAllele frequencyFew examples ofhigh-effectcommon variantsinfluencingcommon diseaseCommonvariantsimplicated incommon diseaseby GWACommonarker data wereFigure used 2.4.: Figure Different 1 | Feasibility phenotypes of identifying are characterised genetic variants by different by risk combinations allele frequency of variant fre-strength and of genetic effect size. effect Reprinted (odds ratio). by Most permission emphasis from andMacmillan interest liesPublishersods but free of their38.Thisisremarkably andquencyin identifying Ltd (Manolio associations et al., 2009), with characteristics copyright (2009). shown within diagonal dottedlines. Adapted from ref. 42.eritability is not overrelatedor ‘unrelated’ out of the three possible combinations — AA (homozygous for A), AB (heterozygous), orans 39 ;giventhenumlityestimates could be Detectionarduous sample preparation characteristic of capillary sequencing 43 .BB (homozygous forofB).associationsHere, we dowithnotlowdistinguishfrequencybetweenand rareAB andvariantsBA becausewill bethese twogenotypesntial confounding by facilitated are mostly functionally by the comprehensive identical — it does catalogue not matterof which variants allele came withfrom whichparent, only MAF whether $ 1% the being offspring generated actually by has the the 1,000 allele Genomes or not. The Project process (http:// of determiningitability will facilitate the sourcewww.1000genomes.org/page.php), of each allele (mother or father) is known whichas will phasing. also identify manyriance that has been Geneticvariants traits are at roughly lower allele divided frequencies. into two classes: The pilot Mendelian effort of traits, that program and multifactorialates, it may still traits. be Inhas Mendelian already identified traits, a single moreSNP than may 11 million be sufficient new SNPs to trigger in initially disease. low-Mendeliadepthcoverage of 172 individuals 44 .en explained by prefromtrait-associated Current mechanisms for using sequencing to identify rare variants2 is andiseases tend to be rare, but severe in their effect. For example, cystic fibrosis (CF)autosomal recessive disease caused by a mutation in the CFTR gene on chromosome 7, withypes with the actual underlying or co-located with GWA-defined associations includean estimatedtive genetic variance, sequencing population in genomic prevalence regions of about defined 1 inby 1972 strong births and(Scotet repeatedly et al., replicatedand associations recessive, and with having common two copies variants, of aand defective sequencing CFTRa larger gene guarantees frac-that2003). CF isMendelianctual phenotype willheritability estimates the disease tion will ofdevelop. the genome In contrast, in peoplemany with extreme diseases with phenotypes. known genetic In the absence components, suchs of available genetic as some types of GWA-defined of cancer, hypertension, signals, sequencing Type 1 candidate Type 2genes diabetes, in subjects and celiac at thedisease, areention and treatment multifactorial, extremes in that of a they quantitative are thought trait (such to depend as lipid onlevels a relatively or the age large at number onset), of SNPs.counting for riskMultifactorial in a can identify diseasesother tend to associated be more common variants, then bothMendelian common diseases and rare— 45,46 for .An example, thesis.important finding from these studies is that much of the information isestimated prevalence is 1% for celiac disease, and for hypertension in the USA is it 7.3–66.3%provided by people at the extremes of trait distributions, who seem to bedepending on age (Ong et al., 2007), although for T1Dymore likely to carry loss-of-function alleles 47 the prevalence is only about 0.3%..Hypotheseslity from GWAS has Sample on thesizes genetic used architecture for the initial of multifactorial identification disease of are DNA based sequence on the evolutionarys of low minor alleleprinciple variants that strong have generally SNPs tendbeen to bemodest, rare — as and with sample Mendelian size requirementsdisease — whereas weak, MAF , 5%, or of increase essentially linearly with 1/MAF. Much larger samples arenot sufficiently 2 frengarrays 14,41 , nor do needed for the detection of the variants themselves. They also scalehttp://www.ncbi.nlm.nih.gov/omim/219700needed for the identification of associations with variants than thosedetected by classical roughly linearly with 1/MAF given a fixed odds ratio and fixed degreece MAF falls below of linkage disequilibrium with genotyped markers. Sample size for 21ly unless effect sizes association detection also scales approximately quadratically withmodest effect sizes, 1/j(OR 2 1)j, and thus increases sharply as the odds ratio (OR)f overall ‘mutational declines. Sample size is even more strongly affected by small odds

2. Biological BackgroundSNPs tend to be common. The reason for the difference in frequency is negative selectionpressure. The marginal contribution of each SNP to the individual’s fitness influences thedegree of selection against it. Lethal traits are strongly selected against, especially when theyaffect an individual before reproductive age, whereas weakly-acting (low penetrance) SNPsthat affect the individual later in life incur less negative selection pressure. The first majorhypothesis for the architecture of common disease is that there are many weak SNPs contributingto the disease — the common disease common variant (CDCV) hypothesis (Bodmerand Bonilla, 2008; Lohmueller et al., 2003; Pritchard and Cox, 2002; Reich and Lander, 2001).A competing hypothesis — common disease rare variant (CDRV) — assumes that commondisease is caused by a small number of rare but strong SNPs. In practice, there is likely to bea continuum of both allelic frequency and the size of the variant’s effect (Manolio et al., 2009),as illustrated by Figure 2.4, and both hypotheses probably hold true to different extent indifferent traits (Schork et al., 2009). Genome-wide association studies (GWAS), which aim tofind strong SNP associations with phenotypes by examining hundreds of thousands of knownSNPs (see Section 2.4.5), are premised on the CDCV assumption since they largely examinevariants that were a-priori known to be common or they could be confidently declared asSNPs from sequencing data. These constraints impose a lower limit on allele frequencies thatare considered to be SNPs, usually around 1% MAF.2.4.1. Linkage DisequilibriumLinkage disequilibrium (LD) is the phenomena in which regions of DNA that are physicallyclose to each other tend to be more highly correlated in terms of their genotype than thoseregions that are far apart. LD can be explained by genetic recombination. Recombinationis the result of meiosis, the process in which the DNA in a diploid organism is split betweenits haploid gametes (sperm for males and eggs for females), and later recombined during thefertilisation and creation of the diploid offspring. Since, to first approximation, recombinationoccurs uniformly across the DNA 3 , loci that are close to each other have a higher probabilityof being inherited together than loci that are further apart. LD thus manifests itself throughblocks of highly correlated SNPs, as shown in Figure 2.5. A set of SNPs commonly inheritedtogether on the same chromosome is called a haplotype. LD has implications both for analysisof SNP data and for biological interpretation, see Chapter 3 for discussion.LD is estimated from the data, using several alternative approaches. Assume we havetwo SNPs, with two alleles each, ‘A’/‘a’ and ‘B’/‘b’, respectively. The joint probabilities ofobserving the combinations of these two alleles arep AB ∶= P (A, B), p Ab ∶= P (A, b), p aB ∶= P (a, B), p ab ∶= P (a, b).3 Recombination is now known to occur in hotspots and not uniformly across the DNA (Myers et al., 2005).22

© 2010 Nature America, Inc. All rights reserved.mately 32 kb upstream of the gene (rs10919791, P = 6.37 × 10 −10 ; the association with rs4per allele OR 0.77, 95% CI 0.71–0.84; unconstrained OR het 0.76, 95% by several orders of magCI 0.68–0.84 and unconstrained OR hom 0.63, 95% CI 0.50–0.79). The these findings suggestlinkage disequilibrium (LD) between these two SNPs is high, with allele, but further fine-mr 2 = 0.81 in study controls and r 2 = 0.71 in the HapMap CEU. In this NR5A2 encodes a nuregion, there were three additional SNPs, rs3790843, 2.4. The rs12029406 Genetic and Basis ofsubfamily Disease that is predoa−log 10(P)1210864272,721,214 72,854,007rs1327620rs1023102rs1330074rs1023101rs7319076rs9573147rs2875652rs4885080rs985740rs7998062rs1410104rs1928634rs7987880rs9543307rs6562756rs9543310rs9573155rs11838548rs4885088rs840418rs11619223rs9318155rs9318156rs1411327rs1411326rs2210077rs4885091rs2031990rs9564966rs287553rs287548rs192607rs11843025rs9543325rs9592907rs11840175rs1411320rs9318166rs12870000rs1886449rs2224916rs7999587rs7983696rs9543335rs17090102rs9543336rs17090113Physical distance: 132.8 kbLD map type: r 20 0.2 0.4 0.6 0.8 1483624120CombinedCase controlCohortsLRb–log 10(P)108642198,125,014Figure 2.5.: Association for SNPs in chromosome 13 (q22.1) with pancreatic cancer. The 8 di-Figure amonds, 1 Association and squaresresults, represent recombination the log 10 p-values and linkage for association of SNPs. Overlaidare the recombination plots for 13q22.1, rates 1q32.1 (centimorgan and 5p15.33. per megabase). Association On the bottomdisequilibriumresults is a heatmap are shown showing the top the panel LD between for all cohort the SNPs, studies measured (blue by r 2 . Reprinted 6squares), by permission case-control from Macmillan studies (green Publishers squares) Ltd and (Petersen all studies et al., 2010), copyrightcombined (2010). (red diamonds). Overlaid on the association panel foreach locus is a plot of recombination rates (cM/Mb) across the4region from CEU study controls. (a) The LD plot shows a regionThe two SNPsof chromosomeare said to13q22.1be in linkagemarkedequilibriumby the SNPswhenrs9543325they areandindependent: the jointprobabilitiesrs9564966 p AB , ..., p and factor bounded as products by SNPs of the between marginal 13q22.1:72,721,214probabilities, namely p AB = p A p B ,p Ab = p A p b , and p aB13q22.1:72,854,007. = p a p B , and p ab = p a pThese b , where SNPs p Aare , p awithin , p B , a and 600-kb p b are the probabilities 2 ofintergenic region between KLF5 and KLF12. (b) The LD plotobserving allele ‘A’ at the first SNP, allele ‘a’ at first SNP, allele ‘B’ at the second1,296,475SNP,shows a region of chromosome 1q32.1 marked by five SNPs,TERTandSLC6A18allele ‘b’ at the rs3790844, second SNP, rs10919791, respectively. rs3790843, rs12029406 andSince these rs4465241, probabilities and are bounded neverby directly SNPs between observed, 1q32.1:198,125,014but instead estimated from allelefrequencies in and finite 1q32.1:198,317,613. data, there may be Note some that random rs3790844 fluctations and rs3790843 from perfect equilibrium.are located in the first intron of NR5A2, shown above the LD plot.One measure of this deviation is D, defined as(c) The LD plot shows a region of chromosome 5p15.33 markedby rs401681 and bounded by SNPs between 5p15.33:1,296,475and 5p15.33:1,476,905. rs401681 D = p AB −is p A located p B , in the 13th intron(2.1)of CLPTM1L, shown above the LD plot and 27 kb from the TERTwhere D ≥ 0, gene. p For all panels, LD (r 2 A and p B are estimated ) from is depicted the observed for SNPs allele with frequencies. minor Estimating p AB isallele frequency (MAF) > 5% using PanScan controls of Europeanbackground (n = 3,650 unrelated individuals). Locations are fromNCBI Genome Build 36.c−log 10(P)rs10919761rs12120925rs2363451rs6656594rs6694219rs6695659rs6677214rs12029406rs17665538rs10919778rs4915398rs6662512Physical distance: 192.6 kbLD map type: r 20 0.2 0.4 0.6 0.8 1CLPTM1Lrs4075202rs4073918rs2736122rs4975605rs2736100rs2853676rs2853668rs4635969rs4975616rs402710rs10073340rs401681rs31489Physical distance: 180.4 kbLD map type: r 20 0.2 0.4 0.6 0.8 123226 VOLUME 42 | N

2. Biological Backgroundless straightforward, since this is the probability of the AB haplotype, which is not generallyknown in population studies as it depends on the phase: the haplotypes AB/ab (one on eachchromosome of the chromosome pair) cannot be distinguished from the haplotypes aB/Abwithout knowing which allele came from the father and which came from the mother. Therefore,estimating p AB requires either phasing the genotypes or more sophisticated estimationapproaches such as Expectation-Maximisation (Foulkes, 2009). A value of D = 0 representsperfect equilibrium, and positive values represent deviations from equilibrium.One drawback of the D measure is that its range of D depends on the allele frequencies,therefore it cannot be meaningfully compared across SNPs with different frequencies. Anotherstatistic is thus the D ′ , defined aswhereDmax =⎧⎪⎨⎪⎩D ′ =∣D∣Dmax , (2.2)min{p A p b , p a p B } D > 0min{p A p B , p a p b } D < 0The D ′ statistic is scaled such that 0 ≤ D ′ ≤ 1, where a value of 0 represents equilibrium andvalue of 1 is called complete LD.Another related measure of LD is r 2 , which is equivalent to the squared Pearson correlationbetween the genotypes, and can be expressed asr 2 =D 2p A p B p a p b, (2.3)where 0 ≤ r 2 ≤ 1. A value of r 2 = 1 is called perfect LD, indicating that the two SNPs havethe same genotypes in the samples tested.2.4.2. Hardy-Weinberg EquilibriumBriefly, the Hardy-Weinberg principle states that in a randomly-mating population (whereevery individual has the same probability of mating with any other individual), free of selection,mutation, and migration, the genotypes frequencies at a given locus follow a binomialdistribution that is a function of the allele frequencies (Falconer and Mackay, 1996; Hedrick,2009) (for biallelic loci), and the allele frequencies will be fixed between the generations. Inother words, given two parental alleles A and B with respective frequencies p and q (wherep + q = 1), the offspring genotype frequencies for the three possible genotypes are p 2 (homozygotefor A), 2pq (heterozygote), and q 2 (homozygote for B). The loci is then said to bein Hardy-Weinberg equilibrium (HWE). Note that HWE applies to each locus separately, asdifferent loci can be under different selection pressures and different mutation rates.The Hardy-Weinberg principle is useful in case/control genome-wide association studies(GWAS, see Section 2.4.5), as significant deviations of the allele frequencies from HWE (in24

2.4. The Genetic Basis of Diseasethe controls) may indicate genotyping errors.2.4.3. SNP Microarray TechnologySNP microarrays are similar to gene expression microarrays in that they measure thousands ofprobes simultaneously. The main difference, however, is that they are not used for measuringexpression through binding of mRNA, but rather for measuring the binding intensity of DNAitself, for each allele of each SNP. Two of the major manufacturers of commercial SNP arraysare Affymetrix (Santa Clara, CA, USA) and Illumina (San Diego, CA, USA).Modern SNP arrays measure between 500,000 to over 1 million SNPs at a time. The SNPsrepresented on the arrays were chosen to represent (tag) most of the variants identified bythe HapMap 4 project (International HapMap 3 Consortium, 2010; International HapMapConsortium, 2005, 2007), which has sought to characterise the genetic variability in severalhuman populations. As of HapMap3, there are more than 3 million known genetic variants,and 1000Genomes (1000 Genomes Project Consortium, 2010) 5 has around 38 million SNPsas of early 2012; currently, they cannot all be measured on the SNP array. However, this isnot a major impediment due to LD — unobserved SNPs exhibiting high LD with observedSNPs can be imputed based on the observed ones, thus increasing the effective number ofSNPs on the array. Depending on the quality of imputation, the imputed SNPs can be testedfor association with the phenotype as if they were measured on the array in the first place(see Section 2.4.4).The basic steps in a SNP array experiment are (LaFramboise, 2009)• DNA extraction If we are after germline (non somatic) variations, then many bodytissues are suitable since their DNA is identical. Blood samples or cheek swabs arecommon sources of cells from which DNA can be extracted.• Hybridisation and washing DNA is hybridised to the array, which contains shortprobes, designed to be complementary to DNA fragments containing known SNPs.Each allele is measured by a separate probe. As with gene expression arrays, earlierAffymetrix SNP arrays had perfect match (PM) and mismatch (MM) probes, intendedto measure non-specific binding for later correction, whereas more recent Affymetrixarrays, such as the Human SNP Array 6.0 use PM probes only. For Illumina arrays,there is only one probe for each allele of each SNP. After hybridisation, the remainingunbound material is washed away.• Measurement As with gene expression microarrays, SNP arrays are laser scanned toproduce a digital representation of the signal intensities.4 http://www.hapmap.org5 http://www.1000genomes.org25

2. Biological Background• Preprocessing Each allele may have multiple probes, and their individual bindingintensity is statistically summarised to form the signal intensity for the allele. Fromthe ratios of the signal at the two alleles at each locus, and the identity of the probes(matching SNPs with A, T, G, or C), the discrete genotypes (AA, AB, BB) are inferred.Genotype calling, as this process is called, is performed in Affymetrix arrays by toolssuch as RLLM (Rabbee and Speed, 2006) and the subsequent BRLMM (Affymetrix,Inc., 2006), CHIAMO (Marchini et al., 2007), and Birdseed (Korn et al., 2008), and inIllumina arrays by methods such as Illuminus (Teo et al., 2007). The basic principlebehind genotype calling, shown in Figure 2.6, is clustering of the samples into threegroups: homozygous for A, heterozygous (AB), and homozygous for B. The genotypecalling methods differ in their statistical assumptions, such as the distribution of thesamples in each cluster. In the process of genotype calling, SNPs with ambiguousgenotypes are discarded. Typically, SNPs are also filtered based on MAF≥ 1% (sincethere is often not enough statistical power to confidently detect rarer variants) anda statistical test for Hardy-Weinberg equilibrium, which tests for the statistical nonindependenceof the two alleles, a useful indicator of genotyping errors.• Analysis Once the data has been preprocessed and the genotypes have been called,analysis of the data can begin. Below, we discuss Genome-Wide Association studies,and in Chapter 3 we discuss some of the statistical and computational principles behindthe analysis of genetic data.2.4.4. ImputationRoughly speaking, imputation refers to the process of “filling in” unobserved genotypes in adataset, leveraging haplotype patterns in known data to infer the genotypes of the unknownSNPs in our dataset of interest. These genotypes may be unobserved since they were notassayed in the first place, or they may contain missing values representing the situation wherethe genotype calls could not be made with sufficient confidence (low posterior probability).The haplotype information is available in the form of a reference panel, such as thoseavailable from HapMap (International HapMap 3 Consortium, 2010; International HapMapConsortium, 2005, 2007) and the 1000Genomes project (1000 Genomes Project Consortium,2010). These panels are densely genotypes, often containing several million SNPs, and theirphase is known (hence the haplotypes are known). Apart from considering the correlations,there are also advantages in considering distance and recombination rates, as to discount theeffect of SNPs that are further away from the SNP being imputed (Marchini et al., 2007).Once the missing genotypes have been imputed, and the quality of the imputation has beenverified, they can be treated like any assayed SNP and tested for association with phenotypes.This is especially useful since most current genome-wide studies assay far fewer SNPs than26

TECHNICAL REPORTS2.4. The Genetic Basis of Diseasegenotypes at thisprobability was 0cordance with thecalls made by IlluRaw genotype callsRaw + imputed genotype callsWhen we used o1.21.2average maximum0.80.8method is verycalls. These impu0.40.4the 1,444 individAffymetrix and0.00.0perfectly with the0.0 0.4 0.8 1.20.0 0.4 0.8 1.2 conclude that theA allele intensityA allele intensityimprove data qualAlthough we chFigure 5 Imputing missing data at genotyped SNPs. The left panel shows the normalized intensity dataFigure 2.6.: Measured intensities for two alleles of one locus, over all samples in the 1958larly good examplfor a SNP genotyped using both the Affymetrix chip and the custom Illumina chip in the 1958 birthbirth cohort of the WTCCC data (The Wellcome Trust Case Control Consortium, tion, we have foucohort of the WTCCC study. The x and y axes on this plot denote the intensity measurements for thetwo alleles (A 2007). and B, The respectively) samplesatcoloured the SNP. Each red, green, point represents and blue theare measurement called as for BB, a single AB, and AA, systematic improvindividual. The respectively. points in theThe left panel lightare blue colored colour according represents to the genotype missing calls made (CHIAMO by the genotypeSNPs with less maalgorithm used calls bymade the WTCCC withproject posterior (CHIAMO) probability using a calling < 0.9). threshold Theofleft 0.9 and on theright posterior panels shows for example, thatprobability of the genotype calls before calls (blue and denotes afterAA, imputing green denotes the AB, missing red denotes calls, BBrespectively.and light blue denotes ReprintedSNPs in the WTCmissing). The right panel shows the non-missing CHIAMO genotype calls plus the imputed missingby permission from Macmillan Publishers Ltd (Marchini et al., 2007), copyright rather than actcalls. The imputed data agreed 100% with Illumina calls at both missing and non-missing genotypes.(2007).noticeably reducenot shown). Thecalled genotypesaremore available detailed in view the reference of the associated panel, region. but theThe unassayed results from SNPs imputed can be imputed clear and and is likely assessed to vary as from study to stwell. SNPs are useful in (i) assessing strength of signal within the region; downstream analysis methods used. For thExamples (ii) providing of widely a wider range used imputation of SNPs for follow tools up include and (iii) IMPUTE2 indicating(Howie forward et al., to 2009, quantify 2011), the gain in power fMACHpossible(Lilocationset al., 2010),forandcausalBEAGLEvariants.(BrowningFor example,and Browning,there is a2007).imputation, but the issue seems worthysubstantially stronger signal from an imputed SNP (rs7903146) in empirical study.the region than for any of the typed SNPs. This predicted pattern is2.4.5. confirmed Genome by direct Wide genotyping Association of the SNPs Studies in question 10 . The use of a DISCUSSIONbayesian measure of association leads to a similar picture (SupplementaryFig. 3 online).can be thought of as predicting missing daAll multipoint methods for testing associatiGenome-Wide Association Studies (GWAS) seek to find SNPs that are statistically significantlyIn addition, associated wewith observed phenotype, strong correlation with thebetween aim of the detecting extent and the genetic it is becoming basis for clear various that a missing data apobserved decay oftraits the signal such of asassociation disease. Prior and the tounderlying GWAS, the recombination most common many type problems of study in was genetics a7,12–14 . In this plinkage rate. Furthermore, study, whereestimates heritable of traits the certainty were studied of the imputation along family at each trees. imputation In contrast, engine GWAS for genotypes is with a prSNP indicate that the underlying model shows high confidence for our studies, but we also emphasize the broadeusually based on large random samples from unrelated individuals.imputations. These observations are not restricted to this region. For in pointing toward a unifying frameworThe example, traits inunder the 15 regions consideration in the WTCCC in GWAS study arewith either strong continuous, signals of such example, as height we note or weight, that our approach foror association binary, such at genotyped as case-control SNPs for in disease. the trend The test, there basicwere GWAS nineanalysis directly is tocomparable test each SNP to that used in paraindividually which the Pfor value association for the best with imputed phenotype, SNP reduced and tothe discard P value SNPs of the thatwhich do not the pass genotypes a stringent of an untyped variantp-valuebest genotypedcutoff (seeSNPChapterpassing3 forqualitydetails).controlThebyremaininga factor ofSNPsat leastare typicallyover to testagainfortestedcorrelationforbetween genet1.5, and there were four where the change was by more than an order Both methodologies use a genetic map acassociationof magnitudein an 6 .independent validation dataset, and if they are stillthehighlycontributionsignificantmadethenby the markers sit is taken as strong evidence for true association. In the genome-wide causal context, locus, and statistical both methods use a likelihsignificance Validationmust andaccount missing for data the imputation problem of atmultiple genotyped testing: SNPsthe probability involvesofsummation a false positive over untyped variation(type Another 1 error) application increasing of our rapidly imputation with the engine number is in validation of tests: of assuming called kthat independent parametric testslinkage, at a precise famigenotypes and imputation of missing data at SNPs that are actually together with a model of haplotype inhegenotyped in a study. Genotypes can still be imputed at such SNPs(excluding the genotypes at the SNP from the information used forimputation) to provide independent estimates of genotypes at anytyped SNP. To illustrate this, we show the normalized intensity datafrom which genotypes are called for a SNP genotyped in the 1958variation, whereas in case-control studies,replaced by an unknown population genealbased on an unknown genealogy 27 is a morefacilitated by the use of a population geneThe imputation methods at the core of obirth cohort of the WTCCC study on the Affymetrix chip (Fig. 5). The to the Elston-Stewart 15 and Lander-Gre© 2007 Nature Publishing Group http://www.nature.com/naturegeneticsB allele intensityB allele intensity

2. Biological BackgroundType 1 Error Rate0.2 0.4 0.6 0.8 1.0●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●0 20 40 60 80 100Tests kFigure 2.7.: The family-wise Type 1 error rate for k independent tests. The per-test thresholdis α = 0.05.28

2.4. The Genetic Basis of Diseasenominal threshold of α for each test, the probability of at least one test being a false positiveisP (at least one false positive) = 1 − (1 − α) k ,as shown in Figure 2.7. This potentially leads to spurious associations being found when largenumber of SNPs are tested at an otherwise nominally significant threshold such as P ≤ 0.05.Several corrections for the multiple testing issue include variants of the Family-Wise ErrorRate corrections and False Discovery Rate (FDR) corrections (Benjamini and Hochberg,1995; Storey and Tibshirani, 2003). The simplest correction is the Bonferroni correction,which controls the family-wise error, or the probability of at least one error over all tests, andsimply states that the corrected significance for k tests is α/k, for a given nominal significancelevel α. In genome-wide studies, genome-wide significance is usually taken to be P < 5 × 10 −8which is equivalent to a Bonferroni correction for one million tests with a per-test α = 0.05;significance must be more stringent when more SNPs are tested.The completion of the human genome project (The International Human Genome MappingConsortium, 2001; Venter et al., 2001) and improvements in sequencing technologies havelead to projects such as HapMap (International HapMap 3 Consortium, 2010; InternationalHapMap Consortium, 2005, 2007) and 1000Genomes (1000 Genomes Project Consortium,2010), characterising DNA variation across human populations. Together with better identificationof variants, the development of SNP arrays has allowed GWAS to become a commonmethod for analysing genetic associations with phenotype — as of early 2011, the NationalHuman Genome Research Institute Catalog of Published Genome-Wide Association Studies(NHGRI GWAS) (Hindorff et al., 2009) 6 lists more than 4000 significant SNP-phenotype associationscurated from almost 750 studies, covering almost 450 phenotypes. We discuss themethods used in GWAS in Chapter 3.2.4.6. Challenges in Analysis of SNP MicroarraysAs with gene expression experiments, there are challenges and obstacles with SNP arrayexperiments and GWAS. The issues of measurement noise, sample preparation and handling,batch effects, and non-specific binding are similar to those in gene expression arrays. Issuesspecific to SNP arrays and GWAS in general include• Population structure SNPs are inherited, and may vary greatly across differenthuman populations. The inclusion of different populations in one study, without appropriatecorrection, may lead to spurious associations (false positives). To mitigatethese effects, GWAS are typically limited to one heterogeneous population, or statisticalmethods such as Principal Component Analysis are used to correct for populationstructure prior to analysis (Price et al., 2006).6 http://www.genome.gov/gwastudies/29

2. Biological Background• Non-random sampling GWAS assumes random samples from the population. Includingrelated samples, such as samples from siblings, can potentially confound theassociation between genotype and phenotype, since siblings are more likely to experiencecommon environmental factors that affect the phenotype. Unless accounted for inthe analysis, duplicated or potentially related samples should be detected and removedbeforehand.• Ambiguous calls Unlike gene expression, which is inherently continuous, the SNParray signal is a continuous representation of an underlying discrete allele, and the alleleis determined through genotype calling. The genotype calling process is statistical anddepends on certain assumptions, which are different for each array platform. Therefore,it is common to discard genotype calls below a confidence of 0.9 (posterior probability)and mark them as “no calls”. Samples with many such low-confidence calls shouldbe discarded, however when the proportion of missing calls is not too large it may bepossible to statistically impute them from the high-confidence calls.• Sample contamination The DNA samples collected from individuals may be contaminated,either with DNA from other individuals, or with DNA from other organismssuch as bacteria (Bahlo et al., 2010). The former is less of a concern in clinical studies,but is more of an issue for samples taken from residual tissue such as forensic samples.• Sample mix-up Human error in collecting and handling the samples may createsituations where some samples are mis-labelled, and the DNA does not match with thecorrect person, potentially flipping the phenotype status in case/control studies.• The sex chromosomes Humans have 22 pairs of autosomal chromosomes and one pairof sex chromosomes, where males have XY and females XX. The female X chromosomeundergoes a process called X-inactivation, where only one chromosome of a pair is activein each cell, and the choice of which chromosome gets silenced is random. Thus a riskallele in a female X chromosome contributes less risk than an equivalent allele on a maleX chromosome. Therefore, the sex chromosomes are typically excluded from GWA,unless specialised approaches are used (Clayton, 2009b).• Epigenetics Epigenetics refers to heritable mechanisms that are not caused by changesto the DNA sequence itself. Such mechanisms include methylation of the DNA (Eckhardtet al., 2006) and histone modifications (Feinberg, 2007; Goldberg et al., 2007),which modulate the ability of genes to be transcribed and thus affect gene expression inthe cell. Epigenetic marks are not detected in SNP arrays and can potentially confoundGWAS results.• Computational challenges Gene expression experiments are limited to tens of thousandsof probes and typically include on the order of hundreds to a few thousand30

2.4. The Genetic Basis of Diseasesamples. Therefore, most analyses of transcriptomic data have been computationallyfeasible, even on commodity computing hardware. In contrast, SNP data is much larger— datasets commonly include upwards of 500,000 SNPs, and usually several thousandsamples. Meta-analyses, in which the results of several studies are combined togetherto increase statistical power, are larger yet — Zeggini et al. (2008) analysed Type 2diabetes in 2.2 million genotyped and imputed SNPs over more than 10,000 samples,collected from three studies. The size of the data means that standard approaches foranalysing the data, such as fitting of statistical models, are not practical from eithera time or space complexity perspective. We discuss this issue further in Chapter 3;for now, we mention two main characteristics of genetic data that make it amenableto analysis even for large data sizes. The first is the assumption of sparsity, whichmeans that we expect the vast majority of SNPs not to be causal drivers of the phenotype.By taking advantage of statistical models that are based on this assumption,we can efficiently fit sparse statistical models to large data. The second characteristicis discreteness of the genotypes, which are typically represented in terms of dosage ofthe minor allele {0, 1, 2} ( see Section 3.4.1 for discussion of genotype coding). Thediscrete nature of the data allows us to compress or otherwise encode the data so as toreduce computational space requirements, and to accelerate otherwise costly numericalcomputation.2.4.7. The Problem of Missing HeritabilityMany GWAS have found SNP-phenotype associations that are highly statistically significant,across diverse phenotypes such as various diseases, metabolite levels, height, and responseto medication 7 . Yet even the most significant SNPs fail to explain a large proportion ofthe observed variability in phenotype (Manolio et al., 2009), despite the fact that manyof the phenotypes are estimated to be highly heritable, that is, that the phenotype has astrong genetic component. This problem has come to be known as the problem of missingheritability. For example, a large meta-analysis of 16 studies covering more than 38,000samples (Lindgren et al., 2009) investigated SNP associations with central adiposity and fatdistribution in humans. They identified three SNPs (rs987237 with p = 4.5 × 10 −9 , rs7826222with p = 1.2 × 10 −8 , and rs6429082 with p = 2.6 × 10 −8 ), as highly significantly associated withcentral adiposity and fat distribution in humans. Despite being highly statistically significant,the two most significant SNPs explain only a small proportion of phenotypic variance (0.05%and 0.04% for rs987237 and rs7826222, respectively), and only confer a very small effect sizein absolute terms — 0.49 cm and 0.43 cm in waist circumference, respectively.Several hypotheses have been suggested as to why many GWAS fail to account for much ofthe observed phenotypic variability (Eichler et al., 2010). These include the issue of sample7 See http://www.genome.gov/gwastudies for a curated list of GWAS results.31

2. Biological Backgroundheterogeneity, discussed in Section 2.3.2, epigenetics as discussed earlier, and the followingadditional reasons:• Epistasis In a strict statistical sense, epistasis means a non-additive statistical model,that is, where the joint effect of two SNPs on the phenotype is different from the sumof the two marginal effects of the SNPs (Clayton, 2009a; Cordell, 2009; Moore andWilliams, 2009). Since most GWAS employ univariable screening of SNPs, it ignorespotential interactions and associations between them. It may be that some sets ofSNPs have strong joint effects on the phenotype without exhibiting detectable marginaleffects — these would be ignored by a univariable approach as it only considers eachSNP individually.• Rare variants Rare variants are alleles that occur with MAF < 0.5%–5%, and arepotentially not measured by current SNP arrays. Besides being included on the array, orat least being in strong LD with an assayed SNP, what is required to assay rare variantsis larger sample sizes, since the rarer they are, the smaller the chance of capturingenough of them in the sample in the first place.• Weak variants Weak variants are SNPs that contribute very little to overall diseaserisk, but are not necessarily rare (Manolio et al., 2009). Since they have small effect sizes,together with hybridisation and measurement noise, weak variants may not appear to bestatistically significant in GWAS (low signal to noise ratio), especially after correctingfor multiple testing. Despite being individually weak, the total contributions from manyweak variants may still be strong determinants of disease risk. As with rare variants,larger studies are required to be able to adequately filter the noise and variability andfind the weak variants.• Copy number variation Copy number variation (CNV) refers to segments of DNAthat undergo duplication (Freeman et al., 2006; McCarroll and Altshuler, 2007), thuspotentially increasing the cellular expression levels of the duplicated genes. Some CNVare known to occur in cancer (somatic CNV), whereas others are inherited. In practice,the contribution of CNV to heritable disease is unclear — a large study of SNPsand CNVs across eight common multifactorial disease in more than 16,000 samples recentlyconcluded that “CNVs that can be typed on existing platforms are unlikely tocontribute greatly to the genetic basis of common human diseases” (Wellcome TrustCase Control Consortium, 2010). Additionally, they found that many of the CNV werealready indirectly identified by the SNP analysis.• Overestimated heritability Heritability itself is estimated from pedigree data, suchas twin studies, since these allow to control for other potentially confounding factorssuch as shared environment. However, the estimates of heritability may be estimated32

2.4. The Genetic Basis of Diseasewith low precision in some cases, leading to apparent missing heritability when theestimates are high. For example, Nisticó et al. (2006) estimated the heritability of celiacdisease in Italian twins as 0.57–0.87, depending on the assumed population prevalence.Heritability may even change over time and between environments (Visscher et al.,2008).2.4.8. Expression Quantitative-Trait LociGenetic factors are known to be important contributors to the observed variability in geneexpression (Cookson et al., 2009; Gilad et al., 2008; Jansen and Nap, 2001), via the mechanismof expression quantitative trait loci (eQTL). SNPs having significant associations with geneexpression are candidates for being eQTL. We distinguish between cis-QTLs, which are SNPsproximal to a gene, and induce variation in the same gene through affecting its regulatoryregions, modulating the ability of transcription factors to bind. There are different definitionsof what constitutes proximity of a SNP to a gene. As a heuristic, to be considered a cis-QTL,a SNP must reside on the same chromosome as the gene, typically within 1Mb of the geneboundaries or the transcription start site. QTLs that do not fit this definition are consideredtrans-QTLs. The important distinction is that cis-QTLs operate directly on the proximalgene, whereas the trans-QTLs are potentially mediated by secondary transcription factorsor other downstream effects. Unlike standard case/control phenotypes, which represent onephenotype, gene expression microarrays allow us to measure tens of thousands of phenotypessimultaneously.Together with SNP data, we can search for strong associations between SNPs and gene expression,with the aim of understanding the underlying regulatory architecture of the genes —how many SNPs regulate them, how they interact (additively, dominant, epistatic), and whichgenes these SNPs represent (Rockman, 2008). eQTL can also be expanded to include phenotypesdownstream of gene expression, such as disease state, in which case gene expressionis considered an intermediate phenotype, mediating between SNPs and disease. Understandingthe genetic architecture of disease can provide further insight into the driving molecularmechanisms, that would not have been available from examining either gene expression orSNPs in isolation (Barrett et al., 2008; Mackay et al., 2009).All of the issues and challenges that we have discussed previously in relation to gene expressionand SNP experiments apply to eQTL studies as well. Additionally, there is the potentialproblem of gene expression probes overlapping SNPs, therefore introducing spurious correlationsbetween gene expression and the SNP; however, this was not found to be a major sourceof bias in two studies (Doss et al., 2005; Emilsson et al., 2008).Recent examples of large-scale eQTL studies include Göring et al. (2007), who analysedeQTLs for over 20,000 probes, and identified the gene VNN1 as strongly associated with highdensity lipoprotein cholesterol (HDL-C) levels. Later, Emilsson et al. (2008) studied human33

2. Biological Backgroundobesity and related traits such as body-mass index (BMI), using gene expression and SNPdata across adipose tissue and blood; they report detecting almost 3400 gene expression traitswith significant genetic associations in the adipose tissue, after adjusting for factors such sex,age, cell counts and BMI. Veyrieras et al. (2008) mapped eQTL in HapMap lymphoblastoidcells, estimating that most eQTL SNPs are indeed cis-QTLs, residing within or near genes,and that SNPs residing in exons are twice as likely to be eQTL than those residing in introns.2.5. SummaryWe have surveyed some of the fundamental types of molecular biological data: gene expressionand SNPs. Gene expression experiments describe aspects of the intra-cellular processes,through measuring mRNA concentrations. In contrast, SNPs experiments describe geneticvariation that is heritable and largely fixed throughout the life of the individual. Both haveprovided immense insight into the cellular mechanisms underlying health and disease, however,when employing either approach their limitations must be considered as well. First,there are sources of noise and variation that can confound the results of the analysis. Second,both gene expression and SNP experiments only provide one aspect of the underlyingbiology. The true mechanisms are likely to involve complex interactions of different effects,such as gene expression, SNPs regulation, epigenetic factors, and environmental factors. Suchintegrative analyses are being pursued, with the aim of gaining a more holistic view of themolecular processes of the cell.34

3Review of the Analysis of Gene Expression andGenetic Data3.1. IntroductionTwo major efforts in the analysis of gene expression and genetic data have been the detectionof predictive and causal markers of phenotype and prediction of phenotype based on thesemarkers. The implicit assumption is that genes and SNPs that are highly associated with thephenotype are potentially causal 1 and if not causal then at least predictive of the phenotype.Identification of predictive markers also provides evidence into the underlying biological mechanismsand for diagnostic and prognostic purposes, such as predicting metastasis in breastcancer patients (van ’t Veer et al., 2002). Traditionally, gene expression data and genetic datahave been analysed separately and with different methods, due to different characteristics ofthe data: the genotype of an individual is a discrete entity and is largely fixed (barring somaticmutations in cancer), whereas gene expression can be highly dynamic on time scales asshort as minutes. Nonetheless, some general principles of statistical inference are applicableto both. In this chapter we survey some of the major approaches in marker selection andpredictive modelling, contrasting the differences between genetic and gene expression datawhere appropriate.1 We distinguish between association and causality, where the former refers to a statistical relationship betweentwo phenomena, such as a gene being highly expressed in some disease, but perhaps not the cause of thedisease, and the latter which refers to a gene without which disease would not occur (in the simplestsetting). See Pearl (2009) for an in-depth discussion of causality.35

3. Review of the Analysis of Gene Expression and Genetic Data3.2. Supervised Machine LearningTwo of the main paradigms of statistical machine learning are supervised learning and unsupervisedlearning. In supervised learning, we are given training data consisting of examplesof inputs and outputs (also called predictors and responses, respectively), from which we seekto learn input-output relationships within the data; these relationships may be interesting intheir own right, or they may be used for predicting the output given a new set of inputs. Thesecond paradigm is unsupervised learning, where the data is not labelled (there are inputsonly) and the goal is to find interesting groups or trends in the data. This thesis deals mainlywith supervised methods, therefore we will not cover unsupervised methods here.Formally, in the supervised learning setting, we are given some inputs x ∈ X , and someoutputs y ∈ Y, where X and Y are the spaces of all inputs and outputs, respectively, andour goal is to find some function f that maps between X and Y “well”, quantified using aloss function L(ŷ, y) that maps from our predicted output ŷ and the true output y to a realpositive number (the loss)L ∶ Y × Y ↦ R + . (3.1)Since the loss is zero only for perfect mappings ŷ = y and positive otherwise, it is conventionto minimise the loss 2 ; several concrete examples of loss functions will be discussed later. Inorder to minimise the loss, we must find the function f (also called a model) out of a set offunctions H, that minimises the risk R jointly over X and Yinf R(f) = ∫ L(y, f(x)) dP (x, y), (3.2)f∈H X ×Ywhere P is the joint distribution of X and Y. The minimum attainable risk is also called theBayes risk, since it is equivalent to the risk from Bayes’ rule when the true distribution P isknown (see below for discussion of Bayes’ rule).In practice, P is usually not known or is only known approximately, hence the true riskof any given model cannot be known. Therefore, we estimate the risk on our training setconsisting of N samplesˆR(f) = 1 NN∑i=1L(y i , f(x i )). (3.3)Moreover, usually our function f is parameterised by some weights (also called parametersor coefficients). For example, a linear model of one input x can be parameterised by β,y i = x T i β + ɛ i , i = 1, ..., N, (3.4)2 When the loss function is based on a probabilistic model, that is the loss function is a probability densityfunction or probability mass function, it is also called the (log) likelihood function, and we either maximisethe likelihood or minimise the negative likelihood.36

3.2. Supervised Machine LearningEmpirical riskTrue riskBayes riskRiskModel complexityFigure 3.1.: An illustration of the relationship between true and empirical risk as model complexityincreases. The Bayes risk is shown constant since it assumes a fixed modelcomplexity, which is the “correct” (but unknown) model. Empirical risk is therisk observed for a given model in a given finite dataset. On the far left handside, the model can be said to be underfitting, as the empirical risk is higher thanthe true risk. On the right hand side, the model is overfitting, as it has lowerempirical risk than the true risk.where y i ∈ R and x i ∈ R p are the ith output and inputs, respectively, β ∈ R p are the modelweights (parameters), and ɛ i ∼ N (0, σ 2 ) is iid Gaussian noise. For a given model, we then findthe weights that minimise the risk in the training set. This principle is called empirical riskminimisation (ERM), and is equivalent to maximum likelihood from statistics. Since samplesizes are finite, it possible that to achieve small empirical risk ˆR while still having large truerisk R. This phenomena is known as overfitting. Generally, overfitting increases as the set offunctions H becomes richer (that is, model complexity increases) and as the sample size Ndecreases. Intuitively, overfitting means that the function we have chosen f ′ is more complexthan the true function f ∗ and is fitting the noise (random component) in the data rather thanjust the signal (systematic component). Since the noise is random, overfitting will manifestitself by worse (higher) loss over new inputs than in the original data used for fitting the model.Overfitting is mitigated in two ways. One is restricting model complexity. The simplest way37

3. Review of the Analysis of Gene Expression and Genetic Datato restrict model complexity is to use simple models, for example only allowing linear models.However, a more flexible approach is to use regularisation (also known as penalisation), whichreduces model complexity and thus limits the ability of the models to overfit. One suchpenalisation method is the lasso penalty, discussed in Section 3.4.3. Conversely, if modelcomplexity is too low, that is, the function space H is heavily restricted, then we can onlylearn very simple associations from the data, and potentially miss out on more interestingones that could lead to lower risk — this is known as underfitting. Another way to mitigateoverfitting is to estimate the empirical risk ˆR on an independent test set, different to theone use for training, leading to the concepts of cross-validation and bootstrapping, discussedlater. This relationship between empirical risk, Bayes risk, and true risk, as a function ofmodel complexity, is illustrated in Figure 3.1.3.3. Linear Models and Loss FunctionsAll the models employed in this thesis are based on linear models (Eqn 3.4), or transformationsof linear models such as log linear models. The problem of fitting linear and log-linear modelsto data can be cast as minimising a convex loss function. Two common loss functions arethe squared loss for linear regression and the logistic loss for logistic regression, in which casethe loss function corresponds to the negative log of the binomial likelihood. The squared lossfunction over N samples in p variables x 1 , ..., x N , isL(β 0 , β) = 1 2N∑i=1(y i − β 0 − x T i β) 2 , (3.5)where y ∈ R is an N-vector of outputs, x i ∈ R p is the p-vector for the ith sample, β 0 is theintercept, and β ∈ R p is a p-vector of model coefficients. Similarly, logistic loss for binaryoutcomes y i ∈ {0, 1}, is 3L(β 0 , β) =N∑i=1log(1 + exp(β 0 + x T i β)) − y i (β 0 + x T i β). (3.6)Another loss function useful in classification is squared-hinge loss, which is equivalent to aleast-squares support vector machine (SVM) with a linear kernel (Chang et al., 2008),L(β 0 , β) = 1 2N∑i=1max{0, 1 − y i (β 0 + x T i β)} 2 , y i ∈ {−1, +1}. (3.7)3 An equivalent formulation of logistic regression loss is L(β 0, β) = ∑ N i=1 log(1 + exp(y i(β 0 + x T i β))) for y i ∈{−1, +1}.38

3.3. Linear Models and Loss FunctionsLoss0 1 2 3 40/1 losslogistic losshinge losssquared hinge loss−1 0 1 2 3zFigure 3.2.: Four loss functions for classification: 0/1 loss L(z i ) = I(z i < 1), logistic regressionL(z i ) = log(1 + exp(−z i )), hinge loss L(z i ) = max{0, 1 − z i }, and squared hingeloss L(z i ) = max{0, 1−z i } 2 , where z i = y i (β 0 +x T i β) for linear models. Informally,for z ≥ 1 the predicted and observed classes match sign(ŷ i ) = sign(y i ) (correctclassification), and for z < 1 they do not match (mis-classification).39

3. Review of the Analysis of Gene Expression and Genetic DataThe squared hinge loss is similar to the standard hinge lossL(β 0 , β) =N∑i=1max{0, 1 − y i (β 0 + x T i β)}, y i ∈ {−1, +1}, (3.8)but the former (Eqn. 3.7) is twice-differentiable whereas the latter (Eqn. 3.8) is not. Twicedifferentiabilitymeans that the Newton step-size used in coordinate descent (see Section 5.4.1)is the second derivative of the loss wrt β j , whereas to use coordinate descent with the hinge losswe must either use some pre-chosen step size, usually tuning it to achieve good convergence,or use a line search procedure. The disadvantage of the squared-hinge loss is that it is moresensitive to outliers in the data, with the loss increasing quadratically for samples as theymove away from the separating hyperplane (the decision boundary between the two classes),whereas the hinge loss only increases the loss linearly.These three loss functions for classification are shown in Figure 3.2. Also shown is the 0/1loss L = ∑ N i=1 I(y i(β 0 +x T i β) < 1), where a constant loss of 1 is incurred for incorrect classification.The 0/1 loss is non-convex and non-smooth and therefore difficult to optimise (due topossibility of local minima and zero gradients), and the other loss functions can be consideredto be convex relaxations of this loss (Bach et al., 2011), which are easier to optimise sincethey have a global minimum (or several equivalent minima if they are not strictly convex).3.4. Feature Selection — Finding Predictive & Causal MarkersThe term feature selection is used for the task of finding a subset of features that are stronglyrelated to the output, while discarding irrelevant or weakly-predictive inputs. In this thesis,the term feature means either an original input variable, such as a gene or a genotype, ora derived variable after some transformation; we will use the terms feature and variableinterchangeably. The implicit assumption is that only a relatively small subset of featuresare truly associated with phenotype, whereas the remaining ones are spurious, resulting fromrandom noise. The assumption of a relatively small set of relevant features is reasonable ingene expression and SNP analysis since we only expect a small proportion of the inputs tobe truly causal. Even when this assumption does not strictly hold, it may still be usefulto perform feature selection in order to reduce models to manageable size, thus increasingbiological interpretability, while maintaining good predictive performance. If the goal of themodelling process is strictly for prediction purposes, such as diagnostic or prognostic tests forbreast cancer metastasis (van ’t Veer et al., 2002), rather than investigating disease etiology,then feature selection can be used to find a small set of predictive markers, even though theymay not necessarily be causally linked with disease but only statistically associated with it,as demonstrated in the toy example in Figure 3.3. As long as these associations are robust(rather than spurious), they are useful for the task of prediction.40

3.4. Feature Selection — Finding Predictive & Causal MarkersACBphenotypeFigure 3.3.: A toy example of a three-gene network. If gene A is mutated and causes adownstream effect in genes B and C, then all three genes may appear to beassociated with the phenotype, even though gene B is clearly non-causal.It is convenient to group feature selection methods into three main approaches – filtermethods, wrapper methods, and embedded methods (Guyon and Elisseeff, 2003).3.4.1. Filter MethodsIn the filter approach, a simple test statistic is used to pre-screen (filter) the inputs prior tofitting an overall model to the remaining data. Since gene expression is continuous whereasSNPs are discrete, different filters are used for each and we review them separately.Filters for Gene ExpressionIn gene expression, some of the common filtering methods arethe t-test, Pearson correlation, and the signal-to-noise ratio (SNR).The two-sample t-test is a statistical test of differential expression for each gene betweentwo groups, such as in case-control studies. Several variants exist, depending on whether theassumptions of equal sample sizes and equal variances for both classes are used. The simplestvariant, which assumes equal sample sizes and equal variances, ist =¯x 1 − ¯x 2√ , (3.9)(s21+ s 2 2)/Nwhere ¯x k and s 2 kare the sample mean and variance for the kth group (k = 1 or k = 2),respectively, and N is the sample size. A p-value is derived by comparing the t statisticagainst a t-distribution with 2N − 2 degrees of freedom. More recent variants of the t-testinclude the moderated t-test (Smyth, 2004), which is an empirical Bayesian approach wherethe variance estimate is pooled over the genes, in order to reduce estimation variability.Once the p-value has been computed for each gene, some cutoff is applied, such that geneswith p-values below the cutoff are taken to be significantly differentially-expressed. The cutoffcan be arbitrary, simply to reduce the number of genes, or can be decided using corrections41

3. Review of the Analysis of Gene Expression and Genetic Datafor multiple testing, such as the Bonferroni correction, which is simply multiplying the nominalp-value by the number of tests, or by some variant of the false discovery rate (FDR)approach (Benjamini and Hochberg, 1995; Storey and Tibshirani, 2003).In filtering with Pearson correlation r, the correlation of each gene x with the phenotype yis computed asr =1N−1 ∑N i=1 (x i − ¯x)(y i − ȳ)s x s y, (3.10)where ¯x and ȳ are the means of the gene x and the phenotype y, respectively, and s x and s yare the sample standard deviation of x and y, respectively. Genes with small correlation (inabsolute terms) are removed. Similarly, the SNR, as defined by Golub et al. (1999), isSNR = ¯x 1 − ¯x 2s 1 + s 2, (3.11)where s k is the standard deviation for the kth group. Genes with low SNR are removed.The cutoff can be determined by permutation tests, where the phenotype labels are randomlypermuted multiple times but the gene expression data is held fixed. In each permutation,different cutoffs are assessed, each inducing a certain number of false positives, thus estimatingthe null distribution of cutoffs, that is, the distribution of cutoffs when there is no trueassociation of the gene expression values with the phenotype. Finally, a cutoff that induceda sufficiently low false positive rate in the permuted data is applied to the original data.Filters in GeneticsDue to the discrete nature of genetic data, different filters have beenused for filtering SNPs. We first discuss filters for binary phenotypes, most commonly seenin case-control datasets.The simplest approach is the allelic association test, that tests whether any single allele(rather than the genotype as a whole) is associated with the phenotype. This is the --assoctest used by the widely-used tool PLINK (Purcell et al., 2007) 4 . To understand the allelictest, consider the contingency table in Figure 3.4a, where the alleles are tabulated againstthe case-control status y. Note that the genotypes are not considered in this test, but rathereach of the two alleles of a genotype is considered individually, thus doubling the samplesize from N to 2N. Under the null hypothesis, both alleles appear in the same proportionsbetween cases and controls. Deviations from the expected counts are quantified using the X 2statistic summed over the 2 × 2 matrix of observed counts OX 2 2 2(O ij − E ij ) 2= ∑ ∑, (3.12)E iji=1 j=14 http://pngu.mgh.harvard.edu/purcell/plink42

3.4. Feature Selection — Finding Predictive & Causal MarkersCases y = 1 Controls y = 0Allele x = 1 n 11 n 12Allele x = 2 n 21 n 22(a) countsCases y = 1 Controls y = 0Allele x = 1 n 11 /(n 11 + n 12 ) n 12 /(n 11 + n 12 )Allele x = 2 n 21 /(n 21 + n 22 ) n 22 /(n 21 + n 22 )(b) Conditional probabilitiesAllele x = 1Allele x = 2Odds = Pr(y = 1∣x)/Pr(y = 0∣x)(c) Oddsn 11 /(n 11 +n 12 )n 12 /(n 11 +n 12 )n 21 /(n 21 +n 22 )n 22 /(n 21 +n 22 )Figure 3.4.: A contingency table of two alleles versus the case-control status, in terms of (a)counts, (b) conditional probabilities Pr(y∣x), and (c) the odds.where O ij and E ij are the observed and expected counts for the cell in the ith row and jthcolumn, respectively. The expected counts are given by the product of the marginalsE ij = (n 1j + n 2j )(n i1 + n i2 ), i, j = 1, 2. (3.13)The X 2 statistic is tested for significance by comparing it with the χ 2 distribution withone degree of freedom. Note that the allelic test depends on Hardy-Weinberg equilibrium(Section 2.4.2) (essentially, independence between the frequencies of the two alleles), otherwiseit may incur an increased false positive rate.For case-control studies, a common measure of association between SNP and phenotype isthe odds ratio (OR), which is the ratio of the odds of the phenotype in the cases to the oddsof the phenotype in the controls, and is common measure for the strength of the association.The odds of a binary event y ∈ {0, 1} are given byOdds(y = 1) =Pr(y = 1)Pr(y = 0) = Pr(y = 1)1 − Pr(y = 1) . (3.14)An odds of 1 means that both events have same the probability of occurring. An odds > 1means that event y = 1 has higher probability, and vice-versa. Figure 3.4b shows the samecounts in the form of conditional probabilities Pr(y∣x), and the odds are shown in Figure 3.4c.43

3. Review of the Analysis of Gene Expression and Genetic DataThe OR is estimated asÔR = n 11/(n 11 + n 12 )n 12 /(n 11 + n 12 ) /n 21/(n 21 + n 22 )n 22 /(n 21 + n 22 ) = n 11n 22n 12 n 21. (3.15)An OR > 1 means that the odds of an event (such as having a disease) are higher in onegroup, for example in the cases, than in the other group, in this case the controls. An OR< 1 has the opposite interpretation. Note that for genetic data, the direction of the effect isarbitrary, as it depends on the coding of the alleles — which allele is used as the referenceallele. Therefore, both SNPs with high ORs and with low ORs are potentially interesting.The p-value obtained from the χ 2 test is symmetric — an odds ratio of 2 has the same p-valueas an odds ratio of 0.5.Another test for association in case-control studies is the per-genotype test, in which thegenotype (the two alleles) is considered as unit of observation rather than each allele separately.Using genotypes rather than allele counts opens up the possibilities of richer models,such as additive, dominant, recessive, and others.The model can be specified using thegenotype coding. Assuming that the minor allele is denoted ‘A’, the three genotypes ‘aa’,‘Aa’, and ‘AA’, can be have the following coding schemes. Additive models (also called trendmodels) are coded as {0, 1, 2}, denoting the dosage of the minor allele ‘A’. Dominant modelsassume that the effect of the minor allele is dominant, namely, the effect of one minor allele isequivalent to that of two minor alleles, and are therefore coded as {0, 1, 1}. Recessive modelsassume that there is no phenotypic effect unless there are two minor alleles, and are codedas {1, 0, 0}. Genotypic models do not assume any specific relationship between the numberof alleles and the phenotypes, rather each genotype is treated separately (in statistics this iscalled a level in a factor), requiring binary encoding such as {00, 01, 10}.The tests for association within each model are slightly different: for the additive model, aCochran-Armitage test for trend is performed (Clarke et al., 2011),T 2 =[∑ 3 i=1 w i(n 1i n 2⋅ − n 2i n 1⋅ )] 2n 1⋅ n 2⋅n[∑ 3 i=1 w2 i n ⋅i(n − n ⋅i ) − 2 ∑ 2 i=1 ∑ 3 j=i+1 w iw j n ⋅i n ⋅j , ]where the subscript ⋅ represents summation along a row or column (n 1⋅ is total along row 1,n ⋅1 is total along column 1), and w = (w 1 , w 2 , w 3 ) is the genotype coding, (0, 1, 2) for theadditive model. The T 2 statistic is compared to the 1-df χ 2 distribution to derive statisticalsignificance. For the dominant and recessive models, a 2 × 2 contigency table can be constructedand tested using a 1-df χ 2 -test. For the genotypic test, a 3 × 2 contingency table isused, and tested for significance using a 2-df χ 2 -test.The final class of tests are model-based tests, commonly including logistic and linear regression.Typically, such tests include only one SNP as an variable, however, other externalcovariables such as sex and age can be included (see Chapter 6 for discussion of multi-SNP44

3.4. Feature Selection — Finding Predictive & Causal Markersmodels.) The logistic model assumes that the log-odds of the phenotype y is a linear functionof the jth genotype and other covariables, namely,log Pr(y i = 1∣x ij )Pr(y i = 0∣x ij ) = β 0j + β 1j x ij + α j z 1 + α 2 z 2 + ... + α k z k , i = 1, ..., N, (3.16)where x ij is the ith genotype for the jth SNP (coded using one of the additive, dominant,recessive, or other models), y i is the ith case/control phenotype, β 0j and β 1j are the interceptand the regression coefficient for the j genotype, respectively, and α 1 , ..., α k are optionalcoefficients for k external variables z such as sex and age. Similarly, for continuous phenotypes,a common model is the linear modely i = β 0j + β 1j x ij + α 1 z 1 + α 2 z 2 + ... + α k z k + ɛ ij , ɛ ∈ N (0, σ 2 j ), i = 1, ..., N. (3.17)In both cases, a p-value for the association between each SNP and the phenotype is used tofilter the SNPs, based on the t statistic in the linear regression case and on the approximate z-statistic (Wald test) or likelihood-ratio test in the logistic regression case.3.4.2. Wrapper MethodsThe second type of feature selection methods are wrapper methods, which are “black box”approaches, in that they attempt explore the search space of all possible features, fitting amodel to each feature subset. The model is evaluated using some measure, such as predictiveperformance on a test set, which is used to guide the search towards better models. The searchstrategy must be chosen as well, for example, greedy forward search in which the feature thatadds the most to predictive performance is included in the model, one at a time, or greedybackwards search where we start with all variables and then repeatedly drop the variable thatdegrades performance least.The main advantage of wrapper methods over filter methods is that they fit an entire modelto the data, not assuming independence of the inputs, whereas filter methods examine theassociation of each variable with the phenotype separately, thus implicitly assuming that theeffect of one variable is independent of the effect of another, which is clearly not the case withgene expression and SNP data. With wrapper methods, correlations or other interactionsbetween the inputs can be taken into account, which would have been missed when filteringeach input separately. Another advantage is that wrapper approaches allow flexibility, inthat they be applied to any existing type of model, such as linear and logistic regression andsupport-vector machines.Wrapper methods have two main disadvantages. The main disadvantage is that they arecomputationally expensive (NP hard for exhaustive search), since the space of all featurecombinations is exponentially large in the number of possible features (2 p combinations for p45

3. Review of the Analysis of Gene Expression and Genetic Datafeatures). Heuristics are often used to circumvent this problem. Computational complexity iscompounded by the need to fit different models many times over. This is especially a problemfor large datasets, which are common in genomics. Therefore, wrapper methods have not beenwidely applied in this area. Second, predictive ability may not be a sensitive indicator of theimportance of a variable to the model, especially when the model already includes severalstrong variables — the addition or removal of one variable might not change the predictiveperformance substantially, in which case it is not clear whether to retain the variable or toexclude it.3.4.3. Embedded MethodsIn contrast with wrapper methods, embedded methods perform feature selection as part of themodel fitting process, using the weights estimated for each feature as the basis for inclusionin the final model, rather on than the overall model predictive performance. Two of the bestknown embedded approaches are the recursive feature elimination (RFE) (Guyon et al., 2002)and the lasso (Tibshirani, 1996).Recursive Feature Elimination RFE is conceptually similar to a wrapper method in that amodel is fit to a subset of inputs and its predictive ability is then evaluated. However, thedifference is that instead of performing a search in the space of all feature combinations andthus deriving an importance for each feature, the importance of each feature is derived from its(absolute) weight in the model, for example, the regression coefficient in a logistic regression.In addition, RFE starts from the full model and successively removes features. For example,in a linear regression with 100 possible features, a model is fit to all 100 features. Predictiveperformance (R 2 in this case) is used to assess the model. Then the feature with the smallestregression coefficient β is removed, and the model is re-fit. This process is repeated until nofeatures are left (to assess the contribution of each feature to the model) or alternatively untilperformance degrades (to find the best predictive subset). RFE can be applied to any modelthat maintains an internal representation of each feature’s importance, such as support vectormachines (with small modifications necessary for non-linear kernels), and linear and logisticregression.As with wrapper methods, RFE can be computationally expensive since a separate modelis fit to each set of features. In addition, there are many schemes for removing features, suchas one at a time and several in one batch. It is not clear whether removing the features withthe smallest weight in the model is always a sensible approach — for example, some featuresmay have small weights but small variance as well, whereas other have higher weights butlarger variance, in which case the features with the lower variance may be more useful tokeep. Scaling the features to unit variance may help in this matter.46

3.4. Feature Selection — Finding Predictive & Causal MarkersThe LassoThe lasso (Hastie et al., 2009a; Tibshirani, 1996) is an approach to fitting modelsthat are penalised by the sum of the absolute value of the weights, and it performs featureselection as part of the fitting.To motivate the lasso, we first discuss the concept of penalised loss functions. As mentionedearlier, many statistical models can be expressed as the minimising solutions to a loss function.Generalised linear models (GLM), which include linear and logistic regression, are fit using theprinciple of maximum likelihood, and for them the loss function is the negative log likelihood.In the case of support vector machines, the loss function is the hinge loss.Taking linear regression as a concrete example, the loss function L over N samples and pvariables is equivalent to the sum of squares of the residualsL(x, y, β) = 1 2N∑i=1(y i − x T i β) 2 , (3.18)where x i is p vector of the ith sample of each input variable, y i ∈ R is the output variable,and β ∈ R p is the p vector of regression coefficients (we omit the intercept here for notationalconvenience). The solutions that minimise the negative log likelihood are the same as themaximum likelihood estimates β ∗ .The lasso penalised loss ispL ′ = L + λ∣∣β∣∣ 1 = L + λ ∑ ∣β j ∣, (3.19)where λ ≥ 0 is a user-determined parameter that determines the degree of penalisation, and∣∣⋅∣∣ 1 is the l 1 -norm. As λ → 0, the solution β ∗ approaches the unpenalised maximum likelihoodsolutions. The lasso has the effect that for high values of λ, some of the variables are set exactlyto zero, and the non-zero variables are shrunk towards zero.Another common penalisation method is l 2 penalisation, also called ridge regression, expressedasj=1pL ′ = L + λ∣∣β∣∣ 2 2 = L + λ ∑ βj 2 , (3.20)where the penalty is the squared l 2 -norm of the coefficients. In contrast with l 1 penalisation,l 2 penalisation does not generally induce sparse models, that is, all coefficients in an l 2 aretypically non-zero (except in unrealistic scenarios with zero noise). Hence, l 2 penalisationis not sufficient for feature selection, and must be augmented with another method such asRFE.We defer discussion of how lasso models are fit to data to Chapter 6. For now, we discussthe feature selection properties of the lasso. For an intuitive explanation for why lasso setssome variables to zero exactly, we recast the penalty formulation (also called the Lagrangianj=147

3. Review of the Analysis of Gene Expression and Genetic Dataβ 2 β 2β*β*β 1β 1Figure 3.5.: Penalised squared loss in 2 dimensions. The red contours show the curves of constantloss for different solution pairs (β 1 , β 2 ), and β ∗ is the unpenalised solution.Also shown are the feasible regions (in cyan) imposed by the ridge constraintsβ 2 1 + β2 2 ≤ t (left), and by the lasso constraints ∣β 1∣ + ∣β 2 ∣ ≤ t (right). Adaptedfrom Hastie et al. (2009a).formulation) as the equivalent constrained optimisation formulationˆβ ∗ = arg minβL, subject top∑j=1∣β j ∣ ≤ t, (3.21)where t ≥ 0 is the constraint on the l 1 -norm of the coefficients. As shown for 2 dimensionsin Figure 3.5, the feasible region for the lasso solution is given by ∣β 1 ∣ + ∣β 2 ∣ ≤ t, which is aconvex set (but not strictly convex). The solution to the penalised problem is at the firstintersection of the non-penalised loss with the feasible region. If the solution is at the cornerof the feasible region, then one of the coefficients is exactly zero. These corners occur moreoften with non-differentiable penalties such as the l 1 -norm than with differentiable penaltiessuch as the l 2 -norm (Hastie et al., 2009a).More rigorous analyses of the lasso’s asymptotic behaviour (Knight and Fu, 2000; Zhao andYu, 2006) have shown that under the conditions of irrepresentability, loosely interpreted as thecondition that the covariance between the relevant and irrelevant variables is not too large,and that the true number of relevant variables is small enough, the lasso is consistent, in thesense that as the number of samples N → ∞, the lasso recovers the true non-zero variables (thesupport) with probability one (sometimes called sparsistency). In practice, for a given datasetwe do not know whether this condition holds, since we do not know the relevant and irrelevantvariables in the first place, hence we cannot be certain that the lasso only identifies relevantvariables. This problem can be mitigated by schemes such as stability selection (Meinshausenand Bühlmann, 2006), which is essentially a multiple resampling procedure for determining48

3.4. Feature Selection — Finding Predictive & Causal Markersthe relevance of each variable.The main advantages of lasso-based feature selection are that the lasso problem is a convexoptimisation problem (as long as the loss function is convex as well), allowing us to usethe extensive toolkit of convex optimisation to solve it efficiently. Taking advantage of themodel’s sparsity allows development of fast algorithms, such as coordinate descent, discussedin Chapter 6. As the sparsity is induced by the model fit process itself, there is no need touse external approaches such as RFE or wrappers.Beyond the LassoThe l 1 norm is one instance of the l p norms, defined asp∣∣β∣∣ p = ⎛ ∑ ∣β j ∣ p⎞ ⎝ ⎠j=11/p, p ≥ 1. (3.22)The l 1 norm is the only convex norm that induces sparse models, since it is non-differentiableat 0. However, the lasso has the sometimes undesirable side effect that non-zero variables areshrunk towards zero too much (high bias) (Zhao and Yu, 2006).There also exist penalisation methods based on quasi-norms, where p > 0 is used 5 . In contrastwith the l 1 -norm, norms with 0 < p < 1 induce stronger model sparsity, while shrinkingthe non-zero variables less. However, such norms are not convex (the problem does not havea global solution) and cannot be solved using standard convex optimisation toolsAnother improper norm is the l 0 -norm, which is equivalent to the number of non-zeroelements in a vector∣∣β∣∣ 0 =p∑i=1I(β j ≠ 0), (3.23)where I(⋅) is the indicator function, 1 for true and 0 for false. The l 0 -norm is useful in thatit allows constraining the model to a predefined number of non-zero variables. However, thel 0 -norm is non-convex and non-differentiable, and exactly solving an l 0 -norm constrainedproblem in p variables is equivalent to a combinatorial optimisation problem with 2 p combinations,making it NP-hard, and several approximations to the l 0 norm exist (Weston etal., 2003). Solving the l 1 -norm constrained problem can be seen as a convex relaxation: atractable approximation of the l 0 -norm problem (Bach et al., 2011).The final norm we briefly mention is the l ∞ -norm, which is convex, and is defined as∣∣β∣∣ ∞ = maxj=1,...,p ∣β j∣. (3.24)The l ∞ -norm is useful for penalising groups of variables together.Apart from simple norms, hybrid norms can be also be used for penalisation. One suchmethod is the elastic net (Zou and Hastie, 2005), which combine both the l 1 and the l 25 When p < 1, ∣∣ ⋅ ∣∣ p is not strictly a norm, but a quasi-norm, since the triangle inequality ∣∣x + y∣∣ p ≤ ∣∣x∣∣ p + ∣∣y∣∣ pis not satisfied.49

3.4. Feature Selection — Finding Predictive & Causal Markersthey allow a hierarchical structure (Gelman and Hill, 2007), where several layers of the datacan each get their own distribution. For example, in a case-control study, we can set up twoprior distributions, one for cases and one for controls, which in turn determine the parametersof the distributions of each gene.The Bayesian approach is not without its limitations. First, when analysing data withlittle domain knowledge, it may be difficult to set an informative prior, and a vague (noninformative)prior is used, often chosen using cross-validation which makes the process verysimilar to frequentist penalisation methods. Second, most priors used are chosen from a smallset of convenient analytical forms (for example, the normal distribution for continuous variables,the Dirichlet prior for binomial variables) for which closed-form posterior distributionscan be derived and solved efficiently. When analytical solutions are not available, stochasticmethods such as Markov-chain Monte-Carlo (MCMC) methods are usually employed, whichcan be time-consuming for large datasets and may require expert tuning to determine convergence.It is unclear how well MCMC can be applied to inference on the genome-wide scalewith current commodity computing hardware.3.4.4. Other Methods for Dimensionality ReductionAn alternative approach to feature selection is to perform dimensionality reduction on thedata, such that the data is “compressed” from its original high dimensions to lower dimensions,making it more amenable for analysing using standard classification and regression tools. Onecommon approach is Principal Component Analysis (PCA) (see, for example, Hastie et al.(2009a)), that is based on the singular value decomposition of the data X ∈ R N×pX = UDV T , (3.27)where U ∈ R N×k and V ∈ R p×k are orthogonal matrices with columns that are the eigenvectorsof XX T and X T X, respectively, D ∈ R k×k is a diagonal matrix consisting of the square rootof the eigenvalues of X T X and XX T (same eigenvalues in both cases), and k = min{N, p}.To obtain a dimensionality reduction, we project the data onto the eigenvectorsX ′ = XV. (3.28)We can select as many columns of X ′ as required: the first k columns of X ′ (the principalcomponents) are the best rank-k approximation of the original data X, and this submatrix canbe used as input for any classification or regression method in place of the original data. Eachprincipal component explains some proportion of the variance in X, such that progressivelyusing more principal components explains more variation in X.While PCA is convenient for reducing the dimensionality of large datasets, it has severaldisadvantages. First, the number of principal components (PCs) to use must be determined51

3. Review of the Analysis of Gene Expression and Genetic Datasomehow. A common way to decide is to examine the proportion of variance explained byeach subequent principal component, and cut the number of PCs when a substantial amountof variance has been explained, or when the increase in variance has plateaued (“knee ofthe curve”). Second, PCA is an unsupervised method, and while it is useful for explainingvariation in X, this variation may not necessarily be useful for predicting some output y.Semi-supervised PCA has been proposed to take into account useful variation (Bair andTibshirani, 2004). Third, interpretation of the PCs is less intuitive than that of the originalinputs, since the PCs are linear combinations of the original variables.3.4.5. Other Methods for Classification and RegressionTwo other common approaches to predictive modelling include random forests and boosting.We survey them briefly.Random Forests There are two essential components to the Random Forests (RF) (Breiman,2001) method: trees and bootstrap aggregation (bagging, Breiman (1996)). The methodrelies on generating a large number of trees, each on a slightly different version of the data(achieved by resampling the data with replacement), and then averaging over these trees(bagging), in order to reduce the variance and achieve a single predictor with low bias andlower variance (Hastie et al., 2009a). A basic algorithm for inducing an RF classifier is givenin Hastie et al. (2009a), assuming a user-chosen number of trees B:1. For b = 1 to B:• Draw a bootstrap sample (with replacement) from the data, of the same size asthe original data• Induce a tree T b using the sampled data, by repeating the following process foreach node in the tree until a minimum node size is achieved, or in classificationuntil each node is “pure” (contains only one class):– Select m variables at random from a subset of the m try ≤ p variables– Out of the m variables, pick the best variable in terms of splitting of the data,based on some measure such as classification error rate– Split the node into two child nodes based on the selected variable2. Return the set of trees {T 1 , ..., T B }Using a subset of variables m try < p leads to lower correlation between the trees, and consequentlyto less overall variance in the bagging step and hence to better bagged models thanusing all p variables at each split. An estimate of the expected generalisation predictive performanceof the RF model is given by the out-of-bag (OOB) estimate: at each step, we can52

3.4. Feature Selection — Finding Predictive & Causal Markersestimate the error for the model trained on the bootstrap sample by using it to predict thedata that was not selected, a process similar to cross-validation.For classification of a new sample, the data is fed to each tree T b , and the final predictionis a simple majority vote over the B predictions for each sample.Boosting and Gradient Boosting Machines As in Random Forests, the boosting approach(Freund and Schapire, 1997) relies on multiple classifiers (called learners) which are combinedto produce an ensemble predictor. However, there are substantial differences between the twoapproaches. In contrast with RF, where the base classifiers are large trees (leading to lowbias but high variance), in boosting the base classifiers are weak, capable of predicting theoutcome with probability only slightly higher than chance. Roughly speaking, the basicboosting approach, exemplified in the original algorithm AdaBoost (Freund and Schapire,1997), iteratively adds these weak classifiers to the model, while upweighting the importance ofsamples that were incorrectly classified in the previous round. When the algorithm terminates(for example, after a fixed number of iterations), the final model is a weighted combinationof the individual classifiers weighted by their error induced on the data in their respectiveiteration.Later, Friedman et al. (2000) described the deep links between boosting and optimisationof an exponential loss function, opening the way to expanding the boosting approach to otherloss functions such as logistic loss (logitboost), linear loss, and others. A regression tree isused as the learner in each iteration, and the next tree is regressed on the residuals fromthe previous iteration. A modern approach to boosting is the Gradient Boosting Machine(GBM) (Friedman, 2001; Ridgeway, 1999), which is a fast and flexible method, allowing foruse of multiple loss functions and exploration of interaction terms (ANOVA decomposition),which may be useful for considering epistasis between SNPs. GBM-like approaches for analysisof genomic data have been proposed for analysis of gene expression data (Luan and Li, 2008;Wei and Li, 2007) and more recently for SNP data (Cosgun et al., 2011; Ogutu et al., 2011).3.4.6. DiscussionWe have presented some of the main approaches for supervised learning and feature selection,with an emphasis on penalised methods, particularly the lasso. The lasso approach assumesthat the inputs are truly sparse, in contrast to methods such as ridge regression where eachinput variable gets some non-zero weight. Hastie et al. (2009a) discuss what they call “bettingon sparsity”: in high dimensional data N ≪ p, which is commonly the case in gene expressionand genetic data, if the data are truly sparse in the input variables then lasso-type approacheswill tend to do better than non-sparse methods, whereas if the data are truly non-sparse thenboth types of methods will tend to do badly. Therefore, they advocate using sparse methodsfor high dimensional data. An important practical advantage of sparsity-based methods like53

3. Review of the Analysis of Gene Expression and Genetic Datathe lasso is that assuming sparsity simplifies computations greatly. Lasso penalised models canbe fit efficiently using coordinate descent, enabling analysis of large SNP datasets consistingof thousands of samples and hundreds of thousands of SNPs, as discussed in Chapter 6.The standard lasso method has some drawbacks. First, the lasso tends to arbitrarily selectone variable out of a set of highly correlated variables, in contrast with l 2 penalty thatwill assign all such variables a non-zero weight, albeit with potentially different signs. Thusinterpretation of the weights is different between the two methods — with the lasso, a zeroweight may be given to a variable that is associated with the outcome but is also highlycorrelated with a variable already in the model. This does not imply that the first variableis not important, only that it is redundant in explaining the output (does not further minimisethe loss subject to the lasso constraints). This is important when comparing markerlists generated from different datasets since there might seem to be little overlap between theselected markers, when in fact they are conveying the same information. Second, the lassocan also break down, in the sense of including too many non-relevant variables in the model,when the truly associated variables are highly correlated with the spuriously-associated variables.Finally, the lasso penalty is not directly interpretable, and is usually set throughcross-validation. The Bayesian approaches allows for more principled ways of determiningpriors, such as prior biological knowledge (Kim and Xing, 2011). However, the computationalcost of Bayesian methods is much higher in comparison — Kim and Xing (2011) report fittingmodels to 100,000 SNPs in one day, whereas lasso models can be fit such models in minutesor seconds less on commodity hardware. Ultimately, all statistical models are simplificationsof complex biological reality. The assumptions behind our models can and should be checked,however, models can still be useful and biologically informative even if not completely correct.3.5. Feature Selection and Multivariable Models of Genetic DataAs discussed in Chapter 2, genetic research has largely evolved independently of researchin gene expression, with different analytical tools used in each. This is both for historicalreasons, as the field of genetics predates modern gene expression experiments, and the uniquecharacteristics of the data, such as discreteness of the genotype, the fact that genotypes areinherited and thus show similarities between siblings, and practical considerations owing tothe much larger number of SNPs that are typically assayed, usually an order of magnitudemore than genes assayed in gene expression experiments. Therefore, we now discuss the waysin which of feature selection and statistical models have been applied in genetics, focusingon detection of SNPs associated with human disease in GWAS. We do not discuss pedigreebased methods.Whether used for investigation of binary or quantitative phenotypes, statistical analysisof SNP data is conventionally done on a univariable (per-SNP) basis, where each SNP is54

3.5. Feature Selection and Multivariable Models of Genetic Dataindividually tested for association with a phenotype. This approach is statistically wellstudied,and many tests fall within this category, such as the allelic and genotypic χ 2 tests,the Cochrane-Armitage trend test, and univariable logistic regression (see Section 3.4.1).Univariable statistics have been widely applied for detecting many variants associated with awide range of human diseases and other phenotypes.The univariable approach has several shortcomings. First, a multiple testing correctionmust be applied due to the large number of hypothesis tests performed, in order to controlthe type I error rate. If we use the stringent Bonferroni correction, the multitude of teststranslate to very strict p-value cutoffs which may exclude some predictive SNPs. Second,most univariable analyses do not account for LD between SNPs, leading to selection of highlycorrelated SNPs. While all such correlated SNPs can potentially be biologically informative,many them of may be redundant for prediction of the phenotype, since the marginal informationprovided by each of them is small. In this case, it may be better to select SNPswith weaker marginal association (larger p-value), but that are less correlated with the otherSNPs already selected. Third, merely detecting a set of SNPs is not in itself sufficient forthe purposes of predictive modelling: all detected SNPs must be merged into one model,for example, by fitting a logistic regression to the SNPs with p-value below the cutoff. Anyinformative SNPs that did not pass the cutoff will not be able to contribute to this predictivemodel, and predictive ability may consequently be reduced.An alternative modelling approach to univariable testing is the multivariable modelling ofSNPs—predictive models that take into account all available data concurrently. Specifically,lasso penalised models are an attractive class of multivariable models that address the issuesidentified above. As discussed in Section 3.4.3, the lasso model fit is penalised with a tunablepenalty parameter. Instead of selecting SNPs by p-value, they are included or excluded fromthe model based on how much they contribute to the model fit balanced by the magnitudeof their effect (with the balance determined by the penalty). Thus, the lasso approach potentiallyconsiders all SNPs in the model, with some of them receiving zero weights (becomeexcluded) dependent on the penalty; the penalty is tuned by cross-validation. Therefore,lasso models need not exclude SNPs that do not achieve genome-wide significance, and allSNPs are candidates for inclusion in the model. Further, being a multivariable model —specifically, a linear model — these lasso models account for the correlation between SNPs,in that out of a group of highly correlated SNPs only one may get selected. In other words,the selected SNPs are non-redundant in terms of contribution to the predictive ability of themodel. Finally, the same model can be used for prediction of the phenotype from genotype.The usefulness of lasso multivariable models for modelling SNP effects (Ayers and Cordell,2010) has inspired several methods, including lasso penalised logistic regression (Wu et al.,2009), adaptive lasso (Yang et al., 2010), “Bayesian-inspired” logistic regression (HyperLasso)with two sparsity-inducing priors (the double exponential, which is identical to the lasso, and55

3. Review of the Analysis of Gene Expression and Genetic Datathe normal-exponential gamma prior — NEG) (Eleftherohorinou et al., 2009; Hoggart et al.,2008), and hierarchical Bayesian linear regression with priors based on existing knowledge ofLD structure (Kim and Xing, 2011). Another (non-sparse) Bayesian approach was presentedby Logsdon et al. (2010). These studies and others have largely focused on assessing how wellcausal SNPs can be detected in simulated data, or how well already-characterised SNPs canbe found in real data. These studies have shown that multivariable models, especially sparseones such as the lasso and related priors such as the NEG, are better able to detect causalvariants than univariable (SNP at a time) statistics. Relatively few studies have analysed thepredictive ability of such models across a wide spectrum of human complex disease. Severalnotable exceptions include Kooperberg et al. (2010), who applied lasso logistic regression toseveral datasets (Crohn’s disease, type-1 and 2 diabetes), Wei et al. (2009) who examinedlogistic regression and support vector machine models of the same data, but achieved betterresults in terms of predictive ability, especially for type 1 diabetes, and Eleftherohorinou etal. (2009) who used HyperLasso models, based on pre-identified SNPs belonging to knownpathways, to model disease risk in Crohns disease, rheumatoid arthritis, and type 1 diabetes.These studies have shown the utility of multivariable models in risk prediction, but theyremain the minority approach in analysing SNP data, and the per-SNP univariable approachstill dominates the literature. We employ sparse models throughout this thesis: in Chapters 5and 6 we explore the use of lasso multivariable linear models for analysis of case-control SNPdatasets, in Chapter 7 for analysis of gene expression, metabolites, and SNPs, and in Chapter 8we use a sparse method that is an extension of the lasso, applied to the setting of multiplecorrelated phenotypes.56

4Prediction of Breast Cancer Prognosis usingGene Set StatisticsGene expression microarrays measure mRNA concentration in cells, as a proxy for measuringgene activity. The advent of relatively cheap gene expression microarray technology, especiallycommercial oligonucleotide arrays, has made it possible to assay tissue with variousphenotypes under a multitude of conditions. Due to high interest in the molecular basis ofhuman diseases, there have been many expression experiments that explore human cell linesand human tissue originating from samples with different conditions, especially those relatedto disease such as cancer (for example, (Golub et al., 1999; Ramaswamy et al., 2001; van ’tVeer et al., 2002), and many others (Sotiriou and Pusztai, 2009)). The ultimate aim of suchstudies is two-fold — to better understand the mechanisms underlying disease (etiology), andto define better markers of disease for early detection, diagnosis, and prognosis.In one of the first such studies, van ’t Veer et al. (2002) considered gene expression datafrom breast tissue coming from breast cancer patients, with the goal of predicting whetherdistant metastasis (metastasis into other organs outside the breast) would occur within fiveyears. Subsequently, many studies have produced predictive gene lists for different diseases.However, gene lists produced from similar datasets, or even lists produced from slightly differentversions of the same data, often showed little overlap, raising doubts about the validityof these lists. Our approach to this issue is to use pre-existing knowledge, in the form ofgroups of genes (gene sets), to form aggregate features that are then used for classification.By aggregating over gene sets, the resulting features are less affected by noise or experimental57

4. Gene Sets for Breast Cancer Prognosisvariability. In this chapter we show that the use of gene sets produces feature lists that aremore stable and reproducible, while maintaining the same predictive ability of the individualgenes, and that these gene sets correspond to biologically plausible mechanisms of cancermetastasis, such as the cell cycle.4.1. IntroductionBreast cancer is one of the most common cancers in the Western world, and the most commoncancer among Western women (Weigelt et al., 2005). Much attention has been devoted tounderstanding the biological mechanisms of breast cancer, towards several goals. The firstgoal is early diagnosis, before major symptoms appear. A second goal is prognostication —prediction of prognosis based on current data, mainly the recurrence of distant metastasiswhich is the main cause of death. A third goal is increased biological insight, to allowdevelopment of more effective treatments.Here we concentrate on the issue of prognostication, more precisely, predicting distancemetastasis in breast cancer patients for up to 5 years into the future. The cutoff pointof 5 years is an indeed arbitrary one, and binarising a continuous variable leads to someloss of information — for example, with two patients with relapses of 4.9 and 5.1 yearsrespectively, the first is considered to have a poor outcome but the second is considered as apositive outcome. Continuous time-to-event data can be and has been analysed using survivalmodels. However, we use the binarised form of the outcome since we wish to compare ourresults with previous studies that have done the same, and converting the problem into abinary classification task is convenient since there are many tools for binary classification.Those patients that are predicted to relapse within 5 years are considered to be a highrisksubgroup. The goal is to identify these patients based on gene expression data, sothat they can be treated more aggressively with adjuvant chemotherapy. One of the basicanalyses of gene expression is searching for differentially expressed (DE) genes between twoconditions. One of the simplest analyses of gene expression is the heatmap, which utiliseshierarchical clustering of the samples and the genes, to produce a visual display of patterns ofdifferential gene expression across two phenotypes. Figure 4.1 shows a heatmap of the top 50differentially expressed genes in a subset of 250 samples from five breast-cancer datasets,where hierarchical clustering was used to cluster both the samples and the genes. Heatmapsare useful for visualising gross features of the data, such as whether any there substantialdifferences in gene expression between groups of samples. In this example, roughly threemajor groups of genes with similar expression profiles are visible (group 1 with probesets204475 at through 206023 at, group 2 with 215176 x at through 217157 x at, and group 3with 220177 s at through 216474 x at). Considering the phenotypic status shown for eachsample, we can see some differential expression between the two classes: the left hand side58

4.1. Introductionof the plot is mostly controls (no-relapse), with low gene expression for group 1 and highgene expression for groups 2 and 3. In contrast, on the right hand side there are far morecases (relapse) samples, the genes in group 1 show low expression overall, and group 3 and tosome extent group 2 shows higher expression. While these patterns are highly suggestive ofassociations between the gene expression levels and the phenotypes, in itself this informationis not enough to quantify these associations so that we may be predict the phenotype for anew sample, and hence cannot be used for prognosis prediction.Two of the first studies to propose a prognostic gene list for predicting breast cancer distantmetastasis was by van ’t Veer et al. (2002), and a subsequent study by van de Vijver et al.(2002), where a classifier was trained to predict metastatic class from an annotated datasets(supervised classification). The van ’t Veer classifier is based on correlation between geneexpression and the class label. First, they selected about 5,000 significantly differentiallyexpressedgenes of the 25,000 genes on the array, over 78 samples. Out of these, they thenselected 271 genes that had absolute correlation of 0.3 or higher with the disease status.Of the 271 genes, they selected genes a second time, by starting with an empty list andadding 5 genes at a time. The optimal size of the list was evaluated using leave-one-outcross-validation. Finally, they arrived at a prognostic list of 70 genes (termed NKI70). Theyreported classification accuracy of 83% on the training set. This prognostic gene list wasthen validated on another independent dataset consisting of 19 samples, with similar results.Other prognostic lists were later compared by Fan et al. (2006); Haibe-Kains et al. (2008),and showed high concordance in terms of classifying patients into the same risk categories.Later, the 70-gene list was renamed MammaPrint (Wittner et al., 2008), and is currentlybeing commercialised for diagnostic purposes.Concurrently, Michiels et al. (2005) evaluated seven studies of breast cancer, reporting thatin all but one, predictive ability was not significantly better than random, and that differentstudies produced different lists of prognostic genes. They used random splitting of the datainto training and testing in order to estimate the predictive ability of these models. Similarly,Ein-Dor et al. (2005) again analysed the van ’t Veer data, and again showed that differentlists produced from random perturbations of the same data were highly disjoint. Later,they estimated that thousands of samples, rather than the hundreds routinely available inmicroarray studies, would be required in order to achieve stability of the gene lists (Ein-Doret al., 2006).These results raise several questions. First, are the genes identified by such studies trulyassociated with cancer and metastasis, or are they spurious, the results of complex modelsoverfitting the noisy data? Second, if the genes are associated with cancer, are they alsocausally related to it? A gene may be downstream of a cancer-causing gene and thereforebe associated with cancer but not cause cancer. Third, can a stable gene list be found atall? Fourth, do the different lists actually represent the same underlying pathways and hence59

4. Gene Sets for Breast Cancer PrognosisColor Key−4 0 2 4Row Z−Score216984_x_at217148_x_at211645_x_at215176_x_at217235_x_at217157_x_at214768_x_at207134_x_at205683_x_at216474_x_at218730_s_at204014_at204015_s_at203485_at210222_s_at205898_at220177_s_at209368_at220005_at204475_at206023_at219208_at220085_at213520_at204695_at212949_at221520_s_at220651_s_at219555_s_at204709_s_at221521_s_at201291_s_at201292_at205046_at219990_at201710_at221436_s_at205034_at212022_s_at201890_at202095_s_at219918_s_at203213_at204641_at204822_at202870_s_at208079_s_at204825_at203764_at202705_atKIU_304C89KIU_233C91KIU_184B38KIU_316C64KIU_155B52OXFU_254116 23VDXRHU_317 65 38 44124 62VDX_616 8292124 30VDX_118 VDX_233VDXGUYU_4072 VDX_851161VDXKIU_292C66 105KIU_229C44 78VDX_286 133VDXKIU_15H4 127KIU_231C80 VDX_817VDX_106 6884VDX_631 42VDX_70 48VDX_783VDXGUYU_4080 VDX_798VDX_635VDX_808 VDX_40VDX_815 71VDX_15 72VDX_728 VDX_874 VDX_620 149 169VDX_779 47VDXKIU_282C51 145VDXKIU_266C51VDXRHU_1959 192VDX_32VDX_738 175VDX_122VDXGUYU_4076VDXOXFU_544 VDX_716VDX_797 VDX_782 VDX_647VDXIGRU_219863VDXGUYU_4087VDX_287 VDX_866VDXIGRU_306818VDXRHU_1721 50KIU_28C76VDXKIU_199B55 106104 123 128VDXKIU_27C4 VDX_244 63VDX_240 19VDX_646VDX_627 163VDX_843 VDX_9KIU_151B84VDXOXFU_1328 24KIU_74A63OXFU_348 37OXFU_1328 OXFU_522OXFU_549 12OXFU_1065 OXFU_359176 VDX_602VDXOXFU_157 127VDXRHU_302VDX_33 17104VDX_204 40VDX_108 6VDX_909128OXFU_531 117OXFU_127 142 75VDXKIU_1246 VDX_114VDX_9249KIU_188B13VDXKIU_15E7 45KIU_278C80KIU_87A79OXFU_1605 65OXFU_662181VDXIGRU_24559637 VDX_54VDX_140VDX_137 183VDX_231141 46VDX_741 VDX_66 33 22172 47VDX_200 VDX_729 VDX_18133 55VDX_100 VDX_247122VDXRHU_1315VDX_873VDXRHU_2393VDX_72VDXKIU_125B43VDXRHU_1522VDXGUYU_4022VDXIGRU_246821VDXKIU_15C5VDX_774 VDX_44VDXKIU_1000 VDX_78 166KIU_309C49 196KIU_111B51KIU_86A40 VDX_249VDXIGRU_272823VDXGUYU_4074 198VDXKIU_15E2VDX_714 VDX_93VDXGUYU_4098 VDX_27VDX_846 VDX_777 121131 85VDXRHU_1917VDXRHU_2039 VDX_7VDXIGRU_171260 136VDXRHU_4188 VDX_844VDX_96 139VDX_696 55VDX_217KIU_113B11 167VDX_864 8125VDXRHU_5228 6129VDX_94 153VDX_216VDXIGRU_271982 113103 23VDX_763 VDX_805 108VDXGUYU_4043 50VDXKIU_2656 VDX_903VDX_852 VDX_813 VDX_64 63OXFU_88 13OXFU_559 156 78VDXRHU_1568VDXKIU_2743VDXGUYU_4049VDXKIU_1708VDX_833 VDX_79VDX_633 VDX_20Figure 4.1.: A heatmap showing differentially expressed genes (rows) over a subset of 250samples (columns) from the five breast cancer datasets. Differential expressionwas determined using linear models in limma (Smyth, 2005). Samples are colouredred and blue for < 5 years and ≥ 5 years to metastasis, respectively. Under andover expressed genes are coloured red and green, respectively.60

4.1. Introductionmight be more in agreement when interpreted in the wider biological context than is otherwiseapparent?If the different predictive genes truly represent the same underlying biology, then perhapswhat is needed is to evaluate genes as members of gene pathways, where we loosely define apathway as a set of interacting genes, and use the pathway information to guide the selectionof predictive genes. Ideally, one would like to have detailed gene pathway information, whichcan then be used to select genes with a potential causal link to cancer and metastasis. Thishas largely not been possible due to limitations on data size (too few microarray samplesavailable), and the complexity of gene-gene interactions. Therefore, the problem of findingthe pathway information must be tackled in other ways. One way is to assume that genes withcorrelated expression belong together in one pathway (or are somehow otherwise related toeach other even if they do not interact directly), and to find the sets de novo in the data, usingmethods such as searching over a space of models representing regulation programs (Segalet al., 2003), and using k-means clustering (Yousef et al., 2007). Similarly, van Vliet et al.(2007) used an unsupervised module discovery method to find gene modules, calculated adiscrete module activity score, and used the score as a feature for a naive Bayes classifier.They reported that classifiers based on gene sets were slightly better predictors of breastcancer outcome than those based on individual genes. Chuang et al. (2007) used a mutualinformation scoring approach to analyse known protein-protein interaction (PPI) networks,infer gene pathways, and find subnetworks predictive of breast cancer metastasis.The other main approach to leveraging pathway structure has been to use external pathwayinformation, for example from interactions defined in the literature. Svensson et al. (2006)analysed expression data from ovarian cancers based on gene sets from the Gene Ontology(GO) (Ashburner et al., 2000); to represent each set’s expression they used a statistic that isessentially a majority-vote of the over- and under-expressed genes (whether the set if overorunder-expressed on average). In a large study of 12 breast cancer datasets, Kim and Kim(2008) reported a classification accuracy of 0.676 over 6 additional datasets, using 2411 genesets from GO categories, pathway data, and other sources. The balanced accuracy was 0.64 1 ,averaged over all dataset pairs (one dataset in the pair used for training and the other fortesting). Additionally, Kim and Kim (2008) reported low overlap between the top gene setsidentified, in terms of their common genes. Lee et al. (2008) used the Molecular SignatureDatabase (MSigDB) C2 gene sets (Subramanian et al., 2005), which are lists of manuallycurated genes found to be associated with cancer in the literature and large-scale data miningexperiments; They selected gene sets using the t-test on their constituent genes, and usedthe sets as features for classification in several cancer datasets, including breast cancer. They1 The balanced accuracy is BACC = (sens + spec)/2 where sens and spec are the sensitivity and specificity,respectively, and accounts for uneven proportions of the two classes in the data, unlike the standardaccuracy.61

4. Gene Sets for Breast Cancer Prognosisdid not, however, examine whether features derived from gene sets are any more stable thanthose based on individual genes, a question which is the main focus of our work.Once a tentative or known gene pathway has been identified, the next issue is how to usethe expression levels of its constituent genes in a meaningful way. Some options are to usethe mean or median expression (Guo et al., 2005), the first few principal components (Bild etal., 2006), and the z-statistic (Törönen et al., 2009). Below we examine several approaches,which we call set statistics.In this work we propose using prior knowledge, in the form of pre-specified lists of genes(gene sets) based on the Molecular Signatures Database (MSigDB) (Subramanian et al.,2005), in order to form new features from individual genes. A gene set is a group of genes thathave been selected due to shared functionality, membership in the same biological pathway,or empirical relatedness (coexpression). Moving away from considering genes in isolation,these features serve as proxies for measuring the activity of the set as a whole. There aremany approaches to gene set enrichment (Ackermann and Strimmer, 2009; Subramanian etal., 2005); however, it is not clear whether these enrichment measures imply good predictiveabilities as well. Using five breast cancer datasets, we compare features derived from genessets with features based on individual genes, with respect to the following criteria:• Discrimination: ability to predict metastasis within 5 years, both on average and itsvariance;• Stability of the ranks of individual features within datasets;• Concordance between the weights and ranks of features from different datasets; and• Underlying biologically processes indicated by the gene sets.4.2. MethodsWe now describe the breast cancer datasets used in this work, the gene set statistics approach,and the framework for determining which gene sets are associated with the phenotypic outcome.4.2.1. DataWe used five previously published breast cancer datasets from NCBI GEO (Edgar et al.,2002): GSE2034 (Wang et al., 2005), GSE4922 (we used the untreated subset of the Singaporecohort) (Ivshina et al., 2006), GSE6532 (Loi et al., 2007, 2008) (untreated cohort),GSE7390 (Desmedt et al., 2007), and GSE11121 (Schmidt et al., 2008) (Mainz cohort). All fiveare assayed on the Affymetrix HG-U133A microarray platform (some of the datasets includedother microarray platforms which were removed). We removed quality control probesets, and62

4.2. Methodsprobesets with close to zero variance across the samples. in total, each microarray had 22,215remaining probesets. We normalised GSE2034 and GSE6532 using quantile normalisationas implemented in RMA (Bolstad et al., 2003). For the remaining datasets, raw data wasnot available and we used the normalised data, as normalised by their respective authors inthe original publications. All data were converted to log 2 scale, as gene expression data aretypically better approximated by the normal distribtution on this scale (Wang and Speed,2003). Missingness was very low, with only GSE6532 having 12 missing values, thereforewe used simple median imputation for each gene instead of more sophisticated imputationapproaches (Bø et al., 2004; Kim et al., 2005)The data contains a majority of lymph-node-negative and some node-positive breast cancersamples.For GSE7390, GSE11121, and GSE2034 none of the patients received adjuvanttreatment. For GSE6532 and GSE4922, some patients received adjuvant treatment, thesewere removed from the data.The data contains patients with both ER-positive and ERnegativetumours. Patients were classified into two groups, low and high risk, according tothe time to distant metastasis, using a cutoff point of 5 years. Patients censored before thecutoff were considered non-informative and were removed. The final number of samples foreach datasets are shown in Table 4.1.4.2.2. DiscriminationWe measure discrimination of a classifier using the Area Under the ROC Curve (AUC orAROC) (Hanley and McNeil, 1982), defined asÂUC =N1+N + N − ∑N −∑ [I(ŷ i > ŷ j ) + 1 2 I(ŷ i = ŷ j )] , (4.1)i=1 j=1where N + + N − = N are the number of positive and negative labels, respectively; ŷ i is theprediction for the ith sample, and I(⋅) is the indicator function, I(x) = 1 when x is true and 0otherwise. The sample AUC has the probabilistic interpretation as the (estimated) probabilityof correctly ranking two randomly chosen samples in the correct order (that is, short-termsurvival before long-term survival), plus a correction for ties.AUC = 0.5 is equivalent torandom ranking, whereas AUC = 1 and AUC = 0 correspond to perfect and perfectly-wrongranking, respectively.Unlike the error rate (or, conversely, the accuracy), AUC does notdepend on the class balance of the dataset, hence it can be meaningfully compared acrossdifferent datasets.4.2.3. ClassifiersFeature instability, manifested as discordant gene lists, can be caused both by inherent instability(genes truly have high variance between different samples), by overfitting the classifier63

4. Gene Sets for Breast Cancer PrognosisDataset Samples Lymph Therapy Age (years) Grade ER statusGSE2034OriginalRemovedGood(

4.2. Methodsto the data, and by redundancies in the data, such as perfect correlation between features.This is especially the case in gene expression data, where the number of features p far outweighsthe number of samples N, and the underlying measurements are known to be noisy.In such scenarios, there is always the possibility of overfitting — the situation where a modelperforms well or even perfectly on the training data, but performs worse or even not betterthan random on independent testing data. Therefore, to reduce the risk of overfitting, we usethe centroid classifier (Schölkopf and Smola, 2002). The centroid classifier is equivalent to aheavily-regularised support-vector machine (Bedo et al., 2006) and to Fisher Linear DiscriminantAnalysis (LDA) with diagonal covariance and uniform priors (Dabney and Storey, 2007;Tibshirani et al., 2003). The centroid classifier implements a model with strong assumptionsabout the data — it does not account for the variance of each genes and the fact that genesare correlated — the weight estimated for each gene is independent of the weights estimatesfor all other genes, unlike, for example, logistic regression or an SVM.In practical terms, we expect that the centroid classifier is less prone to overfitting than anSVM or similar classifier. We further stabilise the centroid’s estimates by averaging them overrandom subsamples of the data. Despite its simplicity, the centroid classifier performs well inmicroarray studies (Bedo et al., 2006), where commonly the number of features is much greaterthan the number of samples (p ≫ N), and there is significant noise. For the centroid classifier,we observed discrimination similar to or better than several other classifiers, including SVMs,nearest shrunken centroids (Tibshirani et al., 2003), and the van ’t Veer classifiers.The centroid classifier finds the centroid of each class across the p-features, that is, the p-vector of average gene expression in each class. New observations are classified by comparingtheir expression vector with the two centroids, and choosing the closest centroid. Given anp × N matrix Z = [z ji ] 1≤j≤p , the p-vector centroids of the positive and negative classes are,respectively1≤i≤Nc + = 1N +∑ z i , c − = 1{i∣y i =+1}N −∑ z i ,{i∣y i =−1}where N + +N − = N are the number of samples in the positive and negative class, respectively,and z i is the ith expression vector of p-features (ith column of Z — one sample). The centroidclassifier predicts using the inner product ruleŷ i = ⟨z i − c, w⟩where ⟨x, y⟩ = ∑ p j=1 x jy j is the inner (dot) product, c = (c + +c − )/2 is the point midway betweenthe centroids, and the feature weights w are the p-vector connecting the two centroidsw = c + − c − . (4.2)The sign of ŷ i is then the predicted class. For calculation of area under receiver operating65

4. Gene Sets for Breast Cancer Prognosischaracteristic curves (AUC) we use ŷ i as the prediction, since it produces AUC estimates withlower variance than does use of the binary class prediction sign(ŷ i ), since ties in the ROCcalculation are more likely for discrete predictions {-1, +1} then for continuous predictions,which manifests as jagged ROC curves and equivalently as AUC with higher variance.Note that the centroid classifier used here is similar but not identical to the classifier usedby van ’t Veer et al. (2002); they assigned each sample to the class that had the highestPearson correlation of its centroid with the sample. This is equivalent to our version of thecentroid classifier when the samples are scaled to unit norm (McLachlan et al., 2004).The centroid classifier requires no tuning since it has no hyperparemeters, making it fastto compute. Recursive feature elimination (RFE) (Guyon et al., 2002) can be used to trainclassifiers with different numbers of features, in order to potentially find parsimonious modelswith the best predictive features. In RFE, a full model (all features) is first trained on thedata. Then one or more features are dropped from the model, based on the absolute valueof their weights (smallest weights first), and the model is re-trained using the reduced set offeatures. The process continues until there are no features in the model. For the centroidclassifier, RFE is especially simple since features can be simply eliminated in reverse orderof their absolute weights, and the model does not need to be re-trained each time since theweights for each feature are independent of each other.4.2.4. Internal versus External ValidationSince we have five datasets, it might be reasonable to combine them into one. However, wewere interested in measuring the concordance between datasets, rather than performing ameta-analysis. The inter-dataset analysis emulates the real-world simulation where differentstudies are performed separately, rather than pooled together. Therefore, we distinguishbetween internal and external validation. In the former, we estimate the classifier’s generalisationinside each dataset, using repeated random subsampling; the subsampling is thenused to form a bagged classifier for each dataset (described in Section 4.2.1). Bagging refersto bootstrap aggregation (Hastie et al., 2009a), a procedure that involves training multipleseparate classifiers on random samples of the data building a final classifier based on the averagingthe individual classifiers (either their model weights or their predictions). The randomsamples can be chosen with replacement, as in the bootstrap procedure (Efron and Tibshirani,1993), or without replacement, as in cross-validation. Bagging reduces the variance ofthe predictions, without increasing the bias (Hastie et al., 2009a).We then perform external validation, where the bagged classifier from each dataset is usedto predict the metastatic class of patients from another dataset. This is a more realisticestimate of the classifier’s discriminative ability. For internal validation, we used repeatedrandom subsampling to estimate the classifier’s internal generalisation error, as measured byAUC. In this approach, the dataset is randomly split B times into training and testing parts66

4.2. Methods(2/3 and 1/3 of the data, respectively). We used B = 25 splits. Repeated subsampling with a2/3–1/3 split is similar to the 0.632 bootstrap without replacement (Binder and Schumacher,2008). Each split results in one model; the predictions from B models are then combined intoone bagged prediction by averaging over the B predictions, and using that vector of averagesas the final prediction.4.2.5. Molecular Signatures Database Gene SetsWe used five gene collections, totalling 5452 gene sets, from the Molecular Signatures DataBase(MSigDB) v2.5 (http://broadinstitute.org/gsea/msigdb):• C1: 386 positional gene sets, defined for each human chromosome and cytogeneticband that has at least one gene. These sets represent expression effects associatedwith chromosomal amplifications and deletions, dosage compensation, and epigeneticsilencing.• C2: 1892 curated gene sets, collected from annotated sources such as the Kyoto Encyclopediaof Genes and Genomes (KEGG) (Kanehisa and Goto, 2000; Kanehisa etal., 2010), PubMed publications, and several pathway databases such as BioCarta(http://www.biocarta.com) and Reactome (http://www.reactome.org) (Matthewset al., 2009).• C3: 837 motif gene sets, which are gene sets that share cis-regulatory motifs conservedacross humans, mice, rats, and dogs (Xie et al., 2005).• C4: 883 computational gene sets, derived from data mining of large collections of cancerrelatedgene expression data (Brentani et al., 2003; Segal et al., 2003).• C5: 1454 Gene Ontology gene sets, derived from the Gene Ontology database (Ashburneret al., 2000). Note that GO genes set are not necessarily co-regulated genes.Within and between these collections, the gene sets may overlap.4.2.6. Gene Set StatisticsThe purpose of the set statistic is to reduce the set’s expression matrix to a single vector,which is then used as a feature for classification. The intention is for the set statistic tobe representative of the expression levels of the set in a useful way. Here we describe thedifferent set statistics used. All of our set statistics are unsupervised, in the sense that theydo not take into account the metastatic class, unlike, for example the t-test set statistic (Tianet al., 2005), GSEA (Subramanian et al., 2005), or GSA (Efron and Tibshirani, 2007). Anystandard classifier, such as a support vector machine (SVM), can be employed by using theseset statistics as features. The gene sets statistics used here are shown in Table 4.2.67

4. Gene Sets for Breast Cancer PrognosisX ∶ p × N matrix of gene expressionS 1S 2S 3⎡⎢⎣x 11 x 12 x 13 x 14 x 15 x 16 x⎤17 ⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦x 21 x 22 x 23 x 24 x 25 x 26 x 27x 31 x 32 x 33 x 34 x 35 x 36 x 37x 41 x 42 x 43 x 44 x 45 x 46 x 47x 51 x 52 x 53 x 54 x 55 x 56 x 57x 61 x 62 x 63 x 64 x 65 x 66 x 67Z ∶ M × N matrix of gene set statistics⎡⎢⎣z 11 z 12 z⎤13 ⎥⎥⎥⎥⎥⎥⎦z 21 z 22 z 23z 31 z 32 z 33Figure 4.2.: Schematic of how the gene set features are constructed from three gene sets S 1(red), S 2 (green), and S 3 (blue), each with 2, 3, and 1 gene/s, respectively. Notethat for clarity we show non-overlapping sets, although the sets can overlap inpractice.NotationHere, X = [x ki ] 1≤k≤p is the p × N matrix of gene expression levels, where N is the number of1≤i≤N,samples and p is the number of genes. The ith column (p-vector) of X is denoted x i .Every gene belongs to one or more gene sets S j , such that S j ⊂ {1, ..., p}, for j = 1, ..., Mwhere M is the number of gene sets. The cardinality of the jth set (number of genes in theset) is denoted s j = ∣S j ∣. We use X Sj to denote the s j × N submatrix of X that correspondsto the jth gene set. The resulting M × N matrix of gene set statistics Z = [z ij ] is shown inFigure 4.2.Set Centroid and Set MedianThe centroid of a set S j is the mean expression levels over all genes in the set. The matrix ofall centroids is an M × N matrix with columns (all samples for one gene set)c j = 1 ∑ x k ∈ R N for j = 1, ..., M, (4.3)∣S j ∣k∈S jwhere x k = (x k1 , ..., x kp ) T is the expression level for the kth gene. Similarly, the set medianis the median expression level for all genes in the set for a given sample.The motivation for the set centroid is that is reduces the variance in each feature, sincethe sample variance of the mean of n samples of a random variable X is the square of the68

4.2. Methodsstandard error of the sample mean ¯x. The actual decrease in variance depends on the degreeof correlation of the variables. Another interpretation is that all the genes in the same set areshrunk towards the mean, thereby reducing the effect of outlier genes and reducing potentialoverfitting.The set median for the jth set is m j = (m j1 , ..., m jN ) T , defined asm ji =⎧⎪⎨⎪⎩R j((sj +1)/2) for odd s j ,(R j(sj /2) + R j(sj /2+1)) /2 for even s j ,(4.4)where R j = (R j1 , ..., R jsj ) T is the sorted vector of gene expression values for the jth set, ands j = ∣S j ∣ is the number of genes in the jth set. The set median is less sensitive to outliers thanthe set centroid.Set MedoidThe medoid of a set S j is defined as the genes in the set S j closest in Euclidean distance tothe centroid for each sample im ji = x k ∗i iwhere k ∗ i = arg mink i ∈S j(x ki i − c ji ) 2 (4.5)where x ki is the expression level for the kth gene (out of the ∣S j ∣ genes in the set) in the ithsample. Therefore, m j = (m j1 , ..., m jN ) T is the vector of medoid samples for the jth gene set.Also, this medoid is an element-wise medoid — for each sample separately — not the sameas an overall medoid over all samplesm ′ j = x k ∗where k ∗ = arg mink∈S j∣∣x k − c j ∣∣ 2 2. (4.6)The medoid in Eqn. 4.6 is over all samples, and therefore it is one of the genes in the set,whereas the sample-wise medoid (Eqn. 4.5) each elements of the vector may originate from adifferent gene. The medoid over all samples in Eqn. 4.6 is denoted set medoid2 in the results.Set t-statisticThe set centroid does not take into account different means and variances between the genes,nor the fact that a gene may have a large value of the mean but high variance as well (lowsignal to noise ratio). An alternative is to use the one-sample t-statistic. The matrix oft-statistics is computed by first centering and scaling the expression matrix so that each genehas mean zero and unit variance, and then computing the t-statistic for each set in eachsample√t ij = c ij ∣Sj ∣/sd ij (4.7)69

4. Gene Sets for Breast Cancer Prognosiswhere c ij is the ith coordinate of the jth centroid statistic from Eqn. 4.3, c j = [c ij ] 1≤i≤N ,and sd ij is the standard deviation of the genes in set j in the ith sample. Scaling is doneto prevent spurious t-statistics due to very small variances from inflating the importance of“non-interesting” genes. We also excluded sets with fewer than 30 genes for the same reason.Set U-statisticThe competitive U-statistic for the set, also known as Wilcoxon’s rank-sum statistic (Lehmann,1975), compares the mean ranks of the genes in the set with the mean rank of the genes outsidethe set, for all samples. We define s j = ∣S j ∣ and s ¬j = ∣⋃ M l=1 {x ∈ S l∣x ∉ S j }∣ as the number ofgenes in and out of the jth set, respectively (note that gene sets may overlap). The U-statisticis computed as follows:1. Create a list of gene expression ranks L i = l i1 , l i2 , ..., l isj of the s j genes in the set in theith sample.2. Sum the ranks for the set R ij = ∑ s jk=1 L k.3. The U-statistic for set S j in sample i is thenU ij = R ij − s j (s j + 1)/2. (4.8)For large n (large number of genes in the set), the jth U-statistic is approximately normallydistributed with mean µ = s j s ¬j /2 and variance σ 2 = s j s ¬j (s j + s ¬j + 1)/12. Once the U-statistic is computed, we use the log p-value as the feature for the classifier, using the normalapproximation.The U-statistic is slightly unusual in that it pits gene sets against each other — the distributiondepends on the number of genes sets rather than the number of samples. Goemanand Bühlmann (2007) argue that this statistic is inappropriate since it switches the standardrelationship between genes and samples in the experimental setup (the sample size becomesthe number of genes, not the number of microarrays); however, Barry et al. (2008) considerit a useful statistic nonetheless. In any case, we use this statistics only as a feature for a classifier,and not for making inferences about the statistical significance of the sets’ expressionlevels.First Principal Component of the SetPrincipal Component Analysis (PCA) (Hastie et al., 2009a; Ramsay and Silverman, 2006) isperformed using the singular value decomposition (SVD) of the gene set’s ∣S j ∣ × N expressionmatrix X Sj = [x ki ] k∈Sj , defined as1≤i≤NX Sj = V T S jD Sj U Sj ,70

4.2. MethodsStatisticEquationSet centroid (4.3)Set median (4.4)Set medoid (4.5), (4.6)Set t-statistic (4.7)Set U-statistic (4.8)First principal component of the set (4.9)Table 4.2.: The gene set statistics used in this work.where V Sj and U Sj are matrices whose columns are the left and right singular vectors,respectively, and D Sj is a diagonal matrix with entries consisting of the singular values (thediagonal of D 2 contains the eigenvalues of X Sj X T S j). The first eigenvector v 1 (first columnof V Sj ) explains the highest amount of variance in X Sj . The first principal componentPC1 j ∈ R N of the X Sj matrix is obtained by projecting the data onto that eigenvectorPC1 j = v T 1 X Sj , (4.9)where v 1 is an s j -vector. Hence, PC1 is the best rank-1 approximation of the data. We meancentredand scaled the matrix X Sj to unit variance, gene-by-gene, before applying PCA, inorder to put all genes on the same scale, reducing the effect of genes with larger than usualvariance on the singular vectors.One possible problem with PCA is axis reflection (Mehlman et al., 1995). Since the eigenvectorsof the covariance matrix are determined only up to a multiplicative constant, differentnumerical implementations of PCA may result in eigenvectors of opposite signs. Furthermore,even small perturbations of the same data (such as through bootstrap replications) may yieldflipped signs. This effect is increased in the presence of noise. When used as features forclassification or regression, flipped signs result in flipped estimates of the model parameters.Since there are usually differences between datasets, axis reflection especially manifests itselfin negative correlation between the eigenvectors derived from each dataset. Eigenvectors fromdifferent datasets may also point in opposite directions since the gene set may change the signof its correlation with the phenotype under different experimental conditions; we assume thisis not the case here.To mitigate the effects of axis reflection, the sign of the eigenvectors must be (arbitrarily)fixed. Since we are interested only in the first principal component, a simple heuristic solutionis to fix the sign of the first eigenvector, by flipping the sign of the vector elements if themajority of the signs are negative,v ′ 1 = v 1 sign⟨v 1 , 1⟩,71

4. Gene Sets for Breast Cancer Prognosiswhere 1 = (1, ..., 1) T is a vector of ones of same length as v 1 . Note that, even after thiscorrection, axis reflections might still occur, since some elements in the eigenvector may stillflip their sign if they are close to zero. A second caveat with PCA is that although it finds theprincipal component that explains the most variance in the predictor variables, this principalcomponent may or may not explain the variance in the response variable. A third and finalcaveat with PCA is that although it is intended to reduce the effects of noise on the data,it can itself be sensitive to noise and outliers. For example, while most of the data maylie along one direction (suggesting this direction is a good principal component), adding afew large outliers orthogonal to this direction may result in a different (orthogonal) principalcomponent being chosen.Other PCA variants have been proposed, for example, smoothed or penalised principalcomponents (Ramsay and Silverman, 2006, Ch. 9), and supervised PCA (Bair and Tibshirani,2004). We have not implemented these here, since standard PCA is more common in theliterature and the more sophisticated methods require further tuning.4.3. ResultsWe conducted experiments aimed at evaluating the utility of classifiers based on gene setsstatistics, relative to the individual gene approach. First, we assessed the ability to discriminategood from poor prognosis in the breast cancer datasets. Second, we investigated theprognostic lists derived from these statistics, in comparison with lists derived from individualgenes, and measured their stability across random perturbations of the data. Third, wemapped the prognostic lists to known biological pathways in order to find those pathwaysmost strongly associated with metastasis, both in the data as a whole and in subsets of thedata as defined by breast cancer molecular subtypes.4.3.1. Discrimination of Distant MetastasisFigure 4.3 shows the AUC for external validation — trained on one dataset and tested onanother, a total of 2 × ( 5 )2= 20 predictions (the procedure is not symmetric) — using centroidclassifiers trained on different numbers of features. The maximum number of features is22,215 for genes and 5414 for gene sets. For clarity, we only show the results for classifiersbased on expression of individual genes (denoted “raw”), the set centroid, the set median,and the set t-statistic.Unlike classifiers such as logistic regression or SVMs, the centroidclassifier’s weight of one feature does not depend on the others.While it is known thatgenes are not independently expressed, this strong assumption does not appear to reduceclassification accuracy in our data. The best AUC of about 0.7 is consistent with previousresults based on either lists or individual genes (van ’t Veer et al., 2002; Wang et al., 2005)or of gene sets (van Vliet et al., 2007). The set centroid, both set medoids, set median, and72

4.3. Results0.70●●●● ● ● ●●●●● ●●AUC0.65●0.60rawset.centroidsset.mediansset.medoidsset.medoids2set.t.stat●1248163264128256Features512102420484096541481921638422215Figure 4.3.: Average and 95% confidence intervals for AUC from external validation betweenthe five datasets, n = 2 × ( 5 )2= 20 (train,test) pairs, for different numbers of features.We show only every second confidence interval for clarity. Note that eachdataset ranks its features independently, hence, the kth feature is not necessarilythe same across datasets. Individual genes are denoted raw.73

4. Gene Sets for Breast Cancer Prognosisset t-statistic showed similar or just slightly lower AUC than that of individual genes. Theset PC, and set U-statistic showed statistically significant reductions in AUC, compared withindividual genes (Figure A.6).While the AUC does not seem to improve, on average, by using set statistics rather thanindividual genes, Figure 4.4 shows that the variance of the AUC is lower for the set t-statisticthan for individual genes. This observation is consistent with the expectation that creatingnew features by averaging over individual genes reduces the variance of the input andconsequently that of the classifier’s prediction.Further, we observed that the discrimination of metastasis good/poor outcome from thegene sets statistics were similar in the internal and external validation, indicating that thecentroid classifier did not significantly over- or under-fit the data, and that the AUC estimatesfrom internal validation, which are easier to obtain since only one dataset is required, arerepresentative of the external validation which is more complicated since at least two datasetsare required.4.3.2. Stability of Feature RanksWe were interested in how the ranks of a single feature vary, since we prefer features that arehighly ranked on average and have small variability about that average. If a feature has lowaverage rank and large variability, it may sometimes appear at the top of the list simply bychance when the experiment is repeated, indicating that it is not a reliable predictor. On theother hand, features with high average rank and large variability may still be good predictorson average but will create unstable gene lists, manifesting as different datasets producingdifferent gene lists of similar predictive ability.To evaluate the variability of the ranks, we used the percentile bootstrap to sample theobservations with replacement, generating a bootstrap distribution for the centroid weights forgenes and gene sets in one dataset (GSE4922). Since there are 22,215 genes and only 5414 genesets, a reduced gene list was produced by training a centroid classifier on the GSE11121 datasetand selecting the top 5414 genes based on their absolute centroid weights ∣w j ∣ (Eqn. 4.2); thegene list was fixed across the bootstrap replications.In many cases we are interested in a small signature comprised of the most useful orpredictive features. Therefore, we selected the top 15 genes and gene sets based on theirmean rank. Figure 4.5 shows the mean, 2.5%, and 97.5% percentiles from 5000 bootstrapreplications, for these top features (shown from highest to lowest) using the set centroidstatistic. It is clear that the top gene sets have lower variation than the top genes. In light ofthese results, it is not surprising that lists of prognostic genes show little overlap, as even thebest ranked genes vary considerably within the same dataset, let alone between them; geneset features are more stable.74

4.3. Results0.012Var(AUC)0.0100.0080.006●rawset.centroidsset.mediansset.medoidsset.medoids2set.t.stat●0.004●0.002●● ● ● ● ● ● ● ● ● ● ●1248163264128256512102420484096541481921638422215FeaturesFigure 4.4.: Variance and 95% confidence intervals of the AUC from external validation betweenthe five datasets, n = 2 × ( 5 )2= 20 (train,test) pairs, for different numbersof features. The confidence intervals are [(n−1)s 2 /χ 2 α/2,n−1 , (n−1)s2 /χ 2 1−α/2,n−1 ],where χ 2 α is the α = 0.05 quantile for a chi-squared distribution with n − 1 degreesof freedom, and s 2 is the sample variance.75

4. Gene Sets for Breast Cancer Prognosis2000Gene2000Gene Set15001500Rank100010005005000● ● ● ● ● ● ● ● ● ● ● ● ● ● ●0● ● ● ● ● ● ● ● ● ● ● ● ● ● ●MYBL2 (201710_at)SORL1 (212560_at)DUS2L (47105_at)RP6−213H19.1 (218499_at)NCAPH (212949_at)RARRES1 (206391_at)CD302 (203799_at)SLC44A4 (205597_at)NCAPG (218662_s_at)BUB1B (203755_at)TFRC (208691_at)C4orf18 (219872_at)FAM38A (202771_at)POLQ (219510_at)SNRPA1 (206055_s_at)chr4pDAC_FIBRO_DNGNF2_MKI67chr1q11MIDDLEAGE_DNGNF2_CCNA2GNF2_HMMRGNF2_CDC20GNF2_PCNAGNF2_RRM2GNF2_TTKGNF2_H2AFXZHAN_MM_CD138_PR_VS_RESTP21_EARLY_DNGNF2_CDC2Figure 4.5.: Mean and 2.5%/97.5% of the ranks of genes and gene sets. Ranks are based onthe weight assigned by the centroid classifier to each feature. For gene sets, weused the set centroid statistic. The process was repeated over 5000 bootstrapreplications of the GSE4922 dataset. Features have been sorted by their meanrank.76

4.3. ResultsCorrelation0.60.40.20.0●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●rawset.centroidsset.mediansset.medoidsset.pcsset.t.statset.u.stat.pval.logFigure 4.6.: Spearman rank-correlation of the centroid classifier’s weights from the fivedatasets (n = 10 comparisons). Individual genes are denoted raw.4.3.3. Concordance between DatasetsWe were interested in measuring how the different datasets agreed on the importance of thefeatures (genes or gene sets).We used two approaches: rank correlation of the centroidclassifier’s weights, and concordance of the feature lists. For this section, the classifier wasnot bagged — we trained a single classifier on each dataset. We note that each dataset wasindependently normalised — we are interested in agreement between datasets despite the factthere may be some differences between them that were not captured by the predictive model,such as unknown batch effects or other confounders.A high level of agreement betweenindependent models is a strong indication that the predictive ability of the models is due totrue biological signal and not due to confounding.We measured concordance between the classifier weights estimated from each dataset usingSpearman rank-correlation, a total of ( 5 )2= 10 comparisons (comparisons are symmetric),as shown in Figure 4.6. It is evident that the rank-correlation for the weights of the setcentroids,set median, set medoid, and set t-statistic are higher than for individual genes.This indicates that classifiers built from features based on gene sets are more stable thanthose built using individual genes, and are less likely to overfit. The set U-statistic showedthe lowest concordance of all measures considered, including individual genes.To measure how the ranked lists produced by each dataset agreed on the top-ranked genes,77

4. Gene Sets for Breast Cancer Prognosiswe used the following approach. The features of each dataset were ranked by the absolutevalue of their w weight. However, lists lengths can affect the apparent concordance betweendatasets in a subtle way: lists with smaller numbers of items to rank will achieve higherconcordance on average than longer lists, simply by virtue of having fewer items to choosefrom. Therefore, we ensured that number of items being ranked was identical across all tasks:we used the 4120 gene sets selected by the set t-statistic as the basis for the list of all othergene sets, and for the genes we selected 4120 genes selected by one dataset and used the samelist across the other four gene datasets. Then for each number f of features, f = 1, ..., p, wechose each dataset’s top f ranked features. Next we counted how many of these f featuresoccurred in at least k of the five datasets. Results for k = 5 are shown in Figure 4.7. Listsbased on individual genes show no overlap whatsoever for cutoffs up to about 70 — there areno genes that occur in all five lists of length 70 — and very low overlap even at 200 genes. Incomparison, the set statistics, especially the set medians and the set centroids, produce listswith higher overlap, even at cutoff below f = 10. This result further supports the conclusionthat lists of individual prognostic genes are highly unstable, even when developed on the samedataset (Ein-Dor et al., 2005; Michiels et al., 2005). In contrast, aggregation into set-basedfeatures greatly increases the stability and hence the interpretability of the results.4.3.4. Analysis of Predictive MSigDB SetsHere we analyse which MSigDB sets were highly predictive of distant metastasis, using thecentroid classifier, and examine which genes are over-represented in these sets. Table 4.3shows the top 10 gene sets by rank, where the rank was averaged over the feature ranksfrom the five datasets, using the set centroid statistic. Also shown in enrichment for theGO Biological Process (BP) terms from a Bonferroni-adjusted Fisher’s exact test, for thegenes belonging to these sets. The top sets are enriched for GO BP terms related to thecell cycle and cell division processes, and for the PI3K pathway, which interacts with theRas oncogene (Downward, 2003), confirming the cell cycle as one of the major biologicalmechanisms associated with breast cancer metastasis (Dai et al., 2005; Mosley and Keri,2008). The top gene set, GNF2 MKI67, represents the neighbourhood of Ki-67, a well-knownclinical marker of cancer proliferation (van Diest et al., 2004).# Set Cat. Sign MSigDBDescriptionEnriched GO BP Terms (adj. p-value)78

4.3. Results1 GNF2 MKI67 C4 −1 Neighborhoodof MKI672 GNF2 CCNA2 C4 −1 Neighborhoodof CCNA23 GNF2 TTK C4 −1 Neighborhoodof TTK4 GNF2 HMMR C4 −1 Neighborhoodof HMMR5 GNF2 CDC20 C4 −1 Neighborhoodof CDC20“phosphoinositide-mediated signaling”: 1.95×10 −10 ,“spindle organization”: 5.86×10 −6 , “establishment ofmitotic spindle localization”: 1.10×10 −5 , “kinetochoreassembly”: 5.48×10 −5 , “mitotic chromosomecondensation”: 1.37×10 −4 , “protein complexlocalization”: 2.55×10 −3 , “regulation of striated muscledevelopment”: 2.55×10 −3 , “metaphase platecongression”: 2.55×10 −3“phosphoinositide-mediated signaling”: 4.05×10 −16 ,“DNA replication”: 1.04×10 −9 , “mitotic chromosomecondensation”: 1.32×10 −8 , “regulation of striated muscledevelopment”: 3.76×10 −3 , “metaphase platecongression”: 3.76×10 −3“phosphoinositide-mediated signaling”: < 2.22×10 −16 ,“mitotic chromosome condensation”: 4.35×10 −14 , “DNAreplication”: 1.01×10 −12 , “spindle organization”:1.37×10 −9 , “establishment of mitotic spindlelocalization”: 9.59×10 −5 , “kinetochore assembly”:4.76×10 −4 , “DNA repair”: 5.78×10 −3 , “mitosis”:9.44×10 −3“phosphoinositide-mediated signaling”: < 2.22×10 −16 ,“mitotic cell cycle spindle assembly checkpoint”:1.26×10 −11 , “spindle organization”: 4.89×10 −10 , “mitoticchromosome condensation”: 8.46×10 −8 , “cellproliferation”: 6.22×10 −6 , “DNA replication”:1.09×10 −5 , “establishment of mitotic spindlelocalization”: 5.33×10 −5 , “kinetochore assembly”:2.65×10 −4 , “protein complex localization”: 8.29×10 −3 ,“regulation of striated muscle development”: 8.29×10 −3 ,“metaphase plate congression”: 8.29×10 −3“phosphoinositide-mediated signaling”: < 2.22×10 −16 ,“spindle organization”: 2.20×10 −12 , “mitotic cell cyclespindle assembly checkpoint”: 4.07×10 −11 , “mitoticchromosome condensation”: 1.52×10 −9 , “cellproliferation”: 8.96×10 −9 , “mitosis”: 1.83×10 −8 ,“establishment of mitotic spindle localization”:8.95×10 −5 , “kinetochore assembly”: 4.45×10 −4 , “DNAreplication”: 7.83×10 −3 79

4. Gene Sets for Breast Cancer Prognosis6 GNF2 SMC2L1 C4 −1 Neighborhoodof SMC2L17 GNF2 H2AFX C4 −1 Neighborhoodof H2AFX8 GNF2 ESPL1 C4 −1 Neighborhoodof ESPL19 GNF2 RRM2 C4 −1 Neighborhoodof RRM210 GNF2 PCNA C4 −1 Neighborhoodof PCNA“mitotic cell cycle spindle assembly checkpoint”:5.15×10 −13 , “mitotic chromosome condensation”:7.16×10 −9 , “phosphoinositide-mediated signaling”:2.14×10 −6 , “establishment of mitotic spindlelocalization”: 1.31×10 −5 , “kinetochore assembly”:6.51×10 −5 , “protein complex localization”: 2.90×10 −3 ,“DNA strand elongation during DNA replication”:2.90×10 −3 , “regulation of striated muscle development”:2.90×10 −3 , “metaphase plate congression”: 2.90×10 −3 ,“cell proliferation”: 2.94×10 −3 , “nucleotide-excisionrepair, DNA gap filling”: 3.56×10 −3“cell proliferation”: 9.28×10 −10 ,“phosphoinositide-mediated signaling”: 5.54×10 −7 ,“mitosis”: 8.48×10 −5 , “mitotic cell cycle spindleassembly checkpoint”: 1.33×10 −4 , “protein complexlocalization”: 1.63×10 −3“phosphoinositide-mediated signaling”: 5.38×10 −11 ,“kinetochore assembly”: 3.12×10 −5 , “mitoticchromosome condensation”: 6.75×10 −5 , “spindleorganization”: 7.76×10 −4 , “protein complexlocalization”: 1.67×10 −3 , “regulation of striated muscledevelopment”: 1.67×10 −3 , “metaphase platecongression”: 1.67×10 −3“phosphoinositide-mediated signaling”: 4.52×10 −15 ,“mitotic cell cycle spindle assembly checkpoint”:1.17×10 −9 , “spindle organization”: 1.20×10 −7 , “DNAreplication”: 5.42×10 −6 , “cell proliferation”: 1.97×10 −5 ,“establishment of mitotic spindle localization”:4.09×10 −5 , “kinetochore assembly”: 2.03×10 −4 , “proteincomplex localization”: 6.80×10 −3 , “regulation of striatedmuscle development”: 6.80×10 −3 , “metaphase platecongression”: 6.80×10 −3“phosphoinositide-mediated signaling”: < 2.22×10 −16 ,“DNA replication”: 1.47×10 −15 , “mitotic chromosomecondensation”: 2.36×10 −7 , “spindle organization”:4.33×10 −7 , “establishment of mitotic spindlelocalization”: 9.59×10 −5 , “cell proliferation”: 4.18×10 −4 ,“DNA repair”: 4.33×10 −4 , “kinetochore assembly”:4.76×10 −4 , “mitosis”: 9.44×10 −3Table 4.3.: Top 10 gene sets by average rank over the five datasets, using the set centroidstatistic. GO enrichment p-values are from a Bonferroni-adjusted one-sidedFisher’s exact test (30,330 tests). Sign=−1 if expression is negatively associatedwith long-term survival, and is +1 otherwise. The background list for the testincludes all Affymetrix HG-U133A probesets that could be mapped to GO BPterms, excluding IEA annotations.80

4.3. ResultsFeatures in ≥ 5 datasets10080rawset.centroidsset.mediansset.medoidsset.pcsset.t.statset.u.stat.pval.log# common60402000 50 100 150 200threshold fFigure 4.7.: Concordance of feature lists (genes or gene sets) for different cutoffs f = 1, . . . , 200,counting the number of features occurring in all of the five datasets’ lists, rankedhigher than f. We use raw to denote individual genes. Prior to ranking, weselected 4120 genes (for the raw lists) or gene sets (for the set statistics) to beranked, so that the number unique of items was identical across all lists.81

4. Gene Sets for Breast Cancer PrognosisThe potential advantage of gene sets signatures over individual gene signatures depends onthe degree of these genes’ coexpression. A critical aspect of this performance, therefore, isthe source for the grouping of genes into sets. The MSigDB is composed of five set classesdepending on the annotation used to define the sets. Whereas categories C1 and C3 arederived from chromosomal locations and sequence of regulatory elements, respectively, categoriesC2 and C4 both originate from pathway information and expression profiles related tocancer; C5 is based on GO categories. In contrast with the other four categories, GO sets donot necessarily define co-expressed or co-regulated genes. Hence, it may not be meaningfulto form set statistics over some of these sets. In addition, the datasets these categories arebased on vary with respect to sample size; whereas C4 was based on hypothesis-free examinationof co-expression across almost two thousand expression profiles, C2 is based mainlyon publication of expression profiles, rarely using more then dozens of samples.To see whether different MSigDB categories were more useful for predicting metastasis, wecombined four datasets (GSE2034, GSE4922, GSE6532, and GSE7390) into a single trainingset. A separate centroid classifier was trained on each gene set, using the set centroid classifier,and the gene sets were then ranked by the centroid classifier weights (negative to positive).We then tested the classifiers on the first dataset GSE11121. Finally, we used the two-sampleKolmogorov-Smirnov statistic to compare the ranks from the different categories.Gene Set Enrichment Analysis (GSEA) (Mootha et al., 2003; Subramanian et al., 2005) usesthe counting formulation of the two-sided two-sample Kolmogorov-Smirnov statistic (Hollanderand Wolfe, 1999, p. 182), to quantify the distribution of genes belonging to some geneset relative to genes not in this set. Note that this statistic does not quantify the deviationfrom uniform randomness (which would require the one-sample Kolmogorov-Smirnovtest), but deviation of sets from each other. Equivalently, we used the two-sided two-sampleKolmogorov-Smirnov statistic for testing for enrichment of categories belonging to a givenMSigDB category (C1, C2, C3, C4, C5).First we define the form of the Kolmogorov-Smirnov statistic used here. Let F (t) and G(t)be the cdfs (cumulative density functions) of the two continuous random variables X and Y .The null and alternative hypotheses are, respectively,H 0 ∶ F (t) = G(t) for all t, H A ∶ F (t) ≠ G(t) for at least one t. (4.10)The two-sided two-sample Kolmogorov-Smirnov statistic isK = sup ∣F n (t) − G m (t)∣ , (4.11)twhere F n (t) = 1 n ∑n i=1 I(x i ≤ t) and G m (t) = 1 m ∑m i=1 I(y i ≤ t) are the empirical cdfs of the twosamples (n samples from X and m samples from Y ), respectively, and I(⋅) is the indicator82

4.3. Resultsfunction (1 if the argument is true and 0 otherwise). K is computed asK = maxi=1,...,N ∣F n(z i ) − G m (z i )∣ , (4.12)where z are the combined samples (x 1 , . . . , x n , y 1 , . . . , y m ), that have been ordered in ascendingorder, such that z 1 ≤ z 2 ≤ ⋯ ≤ z N , N = m + n. Our formulation here differs from (Hollanderand Wolfe, 1999, pp. 178–179) in that we do not multiply K by mnd .Under the assumption that X and Y are continuous random variables, there are no tiesbetween F n (t) and G m (t), therefore at each t, the difference F n (t)−G m (t) can either increaseby 1/n or decrease by 1/m, but not both. Hence, K can also be calculated using a cumulativesum SK = max ∣S∣ , (4.13)i=1,...,NwhereS j = S j−1 + δ j , δ j =⎧⎪⎨⎪⎩+1/n if z j is from X,−1/m if z j is from Y .j = 1, . . . , N, (4.14)and S 0 = 0.In GSEA, the cumulative sum S is plotted to show the relative location of each gene set.Similarly, we plot S to show the location of the MSigDB categories in the ranked sets —for each category C k , k ∈ {1, 2, 3, 4, 5}, we take X to represent the weights of the sets fromC k (weights are averages over the five datasets), and Y to represent the weights of the setsoutside the category, that is, C {1,2,3,4,5}∖k . The cumulative sum S k is then computed for eachcategory C k .Kolmogorov-Smirnov p-values are conservative (larger) in the presence of ties (Pratt andGibbons, 1981, pp. 330–331), hence we do not correct for tied ranks. The p-values were basedon the two-sample two-sided test, using ks.test in the R statistical package (R DevelopmentCore Team, 2011).Figure 4.8 shows the cumulative-sum statistic, from which the Kolmogorov-Smirnov statisticis computed, for the ranked gene lists. In order to link that list with performance insample classification, we plotted the centroid classifier’s AUC value for each of these setsalong the rank (considering only one set at a time). The results show that the C4 sets tendto have more extreme centroid weights, especially towards the negative side, than the othercategories. In contrast, C2 weights show a concentration towards the positive weights, albeitmuch smaller. Category C3 tends to be concentrated in the middle ranks, and category C1tends to be concentrated in the negative to middle ranks. Finally, category C5 is distributedmore uniformly across the ranks.These results show that the highly-predictive sets tend to be C4 sets, and to a lesser extentC2. There is no enrichment for C1, C3, and C5 sets, showing that these sets are, as awhole, not useful for breast cancer metastasis prediction. For C5, this may be since GO83

4. Gene Sets for Breast Cancer Prognosis0.3 0.4 0.5 0.6 0.7 0.8−0.4 −0.2 0.0 0.2 0.4−0.2 −0.1 0.0 0.1 0.2AUC●●● ●●●●● ●●●●●●● ●●●●●● ●● ●● ●●●●● ●●●●●●●● ●●●●●●●●●●● ●●●●●● ●●● ●●● ●●● ●● ●●●● ●● ●● ●●● ●● ●●●●● ●● ●●● ●● ●●●●●●●● ●● ●● ●●●● ●●● ●● ●●●●● ●● ●●●●●●● ●● ●●● ●●●● ●● ●●●●●● ●●● ●●●●● ●● ●● ●● ●●●●●●●●●●● ●●●● ●●●● ●●● ●●● ●●●●● ●●●● ●● ● ●● ●●●●●●●●●0 1000 2000 3000 4000 5000Weight0 1000 2000 3000 4000 5000Score0 1000 2000 3000 4000 5000RankC1 C3 C5C2 C4Figure 4.8.: Kolmogorov-Smirnov enrichment for MSigDB categories, using the set-centroidstatistic. (A) AUC and spline smooth for each set, tested on GSE11121. (B)Number of mapped probesets in each set, on log 2 scale, and spline smooth. (C)Two-sample Kolmogorov-Smirnov Brownian-bridge for each MSigDB category(p-values: C1: 1.44×10 −4 , C2: 3.55×10 −15 , C3: < 2.22×10 −16 , C4: 4.22×10 −13 , C5:2.38×10 −2 ).84

4.3. ResultsClass < 5 years ≥ 5 years Total1 ER−/HER2− 35 80 1152 ER+/HER2− 107 423 5303 HER2+ 55 164 219Table 4.4.: Breakdown of samples for each cancer subtypesets do not take the direction of expression change into account, potentially leading to setscomposed of genes with a mixture of positive and negative correlations, which may canceleach other out if averaged over. There was no positive enrichment for the C1 and C3 sets,which represent positional gene sets and motif gene sets, respectively, thus there is no evidencethat chromosomal aberrations (for C1) and changes in cis-regulation (for C3) are significantdrivers of breast cancer metastasis, however, the coverage of these gene sets is probably toolimited to conclusively preclude these categories as factors in metastasis.One possible problem with the set-centroid statistic is that, for small sets, there is a higherprobability of observing a spurious extreme statistic, since the variance of the sample meandecreases with set size, and we are considering potentially thousands of such sets. Thisimplies that spurious set centroids (high absolute value) would be more common in smallersets, leading to a bias towards smaller sets when ranking them. However, there does not seemto be a monotonic relationship between the log-size and rank (Figure 4.9). Additionally, thereis reasonable concordance between the sets as independently ranked by the five datasets. Weconclude that, while spurious effects due to set size cannot be ruled out, they do not seem tobe a major factor in the set’s rank. When such effects are of concern, an alternative to theset centroid can be used, such as the set t-statistic that corrects for differences in set sizesand set variances.4.3.5. Prognostic Gene Sets in Breast Cancer Molecular SubtypesBreast cancer is a heterogeneous disease, with gene expression segregating the cases intodifferent biological and clinically relevant subtypes, potentially implying differing biologicalmechanisms for tumour growth and progression, and suggesting separate cells of origin (Perouet al., 1999; Sotiriou and Piccart, 2007; Sotiriou and Pusztai, 2009). Several molecular subtypeclassifications have been proposed, the most well known of which is the Stanford “intrinsic”classification (Perou et al., 2000; Sørlie et al., 2001, 2003) (named after the “intrinsic” subsetof highly differentially-expressed genes on which the classification is based), which usedhierarchical clustering of gene expression data to define five molecular subtypes: basal-like,ERBB2+ (also called HER2+), normal-breast-like, and luminal A and B (some definitionsinclude the luminal C type as well). These classifications have important clinical implications;for example, HER2+ cases can be treated by trastuzumab (commercially known as85

4. Gene Sets for Breast Cancer PrognosisWeightAUC0.30.40.50.60.70.8−0.4 −0.2 0.0 0.2 0.4●●●● ● ● ●● ●●●●●●●●●●●●●● ●● ●● ●● ●● ● ●● ● ●●● ● ● ●●●● ● ● ● ● ● ●●●● ● ● ●● ●●● ●●●●●● ● ● ● ●● ●● ●● ● ● ● ●● ● ● ● ● ●● ● ●●●●● ● ●●●● ●● ●● ● ●● ●● ● ● ● ● ●●●● ●● ●●● ●● ● ● ●●●●●●● ●●●● ● ●● ●● ● ● ●●● ● ●●●●●● ●● ●●●●●● ● ●●●● ● ● ●● ● ● ● ●● ● ● ● ● ●●●●●● ●● ● ●●● ● ●● ●● ● ● ●●●●● ● ●●●●● ● ● ● ● ●● ●● ● ● ● ● ●● ● ●●●● ● ● ● ● ●● ● ● ● ●● ●●●●● ● ● ● ●● ● ●● ● ● ● ●● ●● ●●●●● ● ● ● ●●●● ●● ●● ● ● ●C1●●●●● ● ● ● ●●●● ● ●●●● ●●●●● ●●●● ●● ● ● ●●●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●●● ●● ●● ●●● ●●● ●●●●●● ●● ● ●●● ● ● ● ●●● ● ●●●●●● ●● ●●● ● ●●●●●●●●●● ●● ● ●●●● ● ● ●●●●●●●●● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ●● ●●● ●●● ● ● ● ● ●● ● ● ●●●●●●● ● ● ● ●●● ● ●●● ● ●●●●●●●● ●● ● ●● ●●● ● ●● ● ●●●● ●●●● ● ● ● ● ●●●● ● ●● ● ●●● ● ● ●●● ● ●●●●● ●● ●●●● ● ●● ● ●●●●●●●●●●●● ● ●● ●●● ● ● ●●●● ● ●●● ● ● ●● ●●● ●●●●● ● ●●●●●●●● ● ●●● ● ● ● ●● ● ● ● ●●● ● ● ● ● ●● ●●●● ● ●●● ●● ●●● ●●●●●●● ● ●● ● ●● ●● ●●●●●●●● ●● ●● ● ●● ● ●●●●●●●●● ●●● ●● ●●●●●● ● ●●●●● ● ● ● ●● ● ●● ● ●●●●● ● ●●●●●●● ● ●●●●●●● ● ● ● ●●●●●●●● ●●●●● ● ●●●●●●● ● ● ● ● ●● ●●● ●●● ● ●● ●●●● ● ● ●● ● ● ●● ● ●● ● ●●●● ● ● ● ●●● ● ● ● ●●●●● ●● ●● ● ●●●●●●●● ● ●●●●●● ●● ●●●●●●● ● ●●●●● ● ●● ● ●●●● ● ● ● ●●● ●●● ●● ● ● ● ●● ● ●●●●● ● ● ● ● ●●●●● ● ●●●● ● ●● ● ● ● ●●●● ● ● ● ● ●●● ● ●●● ● ● ●●● ● ●●● ●●●●●● ● ● ●●● ● ●●●● ●●● ●● ●● ● ● ● ● ● ●● ● ● ● ●● ●●●● ●●● ● ● ●● ●● ●●● ● ● ● ●●●●●●●●●● ● ●● ● ● ● ● ●●● ● ●● ●● ● ● ● ● ● ● ●●● ● ● ●●●●●●●●●● ● ● ● ● ● ●●●● ●●● ●●●●● ●●● ● ●●●●●●●●● ● ●● ● ● ●● ●●●●●●●●●●●●● ●● ●●●●● ●● ● ● ●● ● ●● ● ●● ●● ●● ●● ● ● ● ●● ●●●●● ● ● ● ● ● ●●● ● ● ● ●●● ● ●●●● ●●●● ● ●●●●●●●●●●●●● ● ● ● ●● ●● ● ● ● ● ● ● ●●● ● ●●●● ● ● ● ●●●● ● ●●● ● ●● ● ●●●●● ● ● ● ● ●●● ● ● ● ● ●●●●● ● ●●●●●●●●●●●●●● ● ●●● ● ●●●● ●● ●●● ●●●● ● ● ● ●●●● ● ●●● ● ●● ●● ●● ●●●●●● ●● ● ● ●●● ● ● ● ●● ● ● ● ● ●●●●● ● ●●● ● ● ● ●● ● ●● ● ● ●● ●●●●●● ● ●●● ●● ●●●●●●●● ●● ●● ●●● ● ●●●●●●●● ●● ●●●● ●● ●● ●●●●● ● ● ●●●● ●●●● ● ●●● ● ● ●● ●●●●●●●● ● ●●●●●● ● ●● ● ● ● ●●●●●●● ● ● ● ● ●● ● ● ● ● ●●●●●●●● ● ●●●●●●● ● ●● ● ● ●● ● ● ●● ● ●●●●●●● ● ● ●● ●●● ● ● ●● ●● ● ●●● ● ●● ●●● ● ● ● ●●●● ●● ● ● ● ● ●● ●● ●● ● ● ●● ● ● ●●●●● ●●● ● ●●●●●●● ● ●● ● ● ●●● ● ●●●●●●●●●● ●● ● ● ●● ● ●●● ● ●● ●●● ●●●●●● ● ● ●● ● ●●● ● ●●● ● ●●●●●● ● ●●●●●●●● ● ● ● ●● ● ● ●●●● ●● ● ● ● ● ● ● ●●●● ● ● ● ●●● ●●● ● ●●● ●● ●●●● ●● ●●●●● ●● ● ● ● ● ●●●●● ●●● ● ● ● ●●●●● ● ● ● ● ●● ●●● ● ●●C2●●●● ● ● ● ● ● ● ● ●●● ● ●● ● ●●●●●● ● ●●●●●●●●●●● ● ●●● ● ● ● ● ●● ●● ●● ● ● ● ● ●● ● ● ●● ●● ●●●●● ● ●●●●● ●●●●●● ● ●● ● ● ● ● ● ● ● ●● ●●●●●●● ●● ●●● ●● ● ● ● ● ● ● ●● ● ● ● ●●●●● ● ●●●●●● ●● ●●●● ● ●●●●●● ● ● ●●● ●● ● ●●●●●●● ●●●●● ● ●● ● ●● ● ●●●●●●●●● ● ●●●●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●●● ● ● ● ●●● ● ●●● ●● ●● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ●● ●●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ●●● ●● ● ● ● ●●● ● ● ●● ● ● ●● ●● ●●● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ●●● ● ● ● ● ●●●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●●●●●●●● ● ●●●●● ●●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ●● ● ● ● ●●●● ● ● ● ●● ●●●● ● ● ● ●●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ●● ● ●● ● ● ● ●●● ●●●C30.30.40.50.60.70.8● ●● ● ●●● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ●● ●● ●●●●●● ● ● ● ● ●● ●● ●●● ●● ● ● ●●● ● ● ●● ● ●●●● ● ● ● ● ●●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●●●● ●● ●● ● ●●● ● ●● ● ● ● ●● ● ● ●● ● ● ●●●●●●●● ●●●● ●●●● ●● ●●●● ● ● ● ● ● ●●●●●● ● ● ● ●● ●● ● ●● ● ●●●● ● ● ●● ● ● ●●●● ●● ● ● ● ● ● ● ●●●● ● ● ● ●●●●●●●● ● ●●●●●●● ●●● ●● ● ● ● ● ●● ● ●●●● ● ● ● ● ●●●●●● ●● ● ● ●● ● ● ●● ● ● ● ● ● ●●●●● ● ● ●●● ●● ●● ●● ● ●● ● ● ●●● ● ● ● ● ● ●●●● ● ●●●● ●● ● ●●●● ●●● ● ● ● ● ●●●●●● ●● ● ● ● ●● ●●●●● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ●●● ● ● ●●● ● ●●● ●● ●●●●● ● ●●●● ● ● ●●●● ● ●●● ●●● ●● ●●●●●● ● ● ●●●●●● ●● ● ● ●●●● ● ●●●● ● ● ● ●●●●●●●●●●● ● ●●●● ● ● ●● ● ●● ● ●●●● ● ●●● ● ● ● ●●● ● ● ●●●●●● ● ●● ●● ●●● ● ● ● ● ●● ●●● ●● ● ● ●● ● ● ● ●●● ●● ● ● ● ● ●●●●●● ● ●●● ● ● ●● ● ●●● ●● ● ● ● ●●● ● ● ● ●● ● ●●●●●● ● ●● ●● ● ●● ● ● ● ●●●●●● ●●● ●● ● ● ●● ●● ●●●● ● ● ● ●● ●●●●●● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ●●● ● ●●● ● ● ●●C40.30.40.50.60.70.8● ● ●● ● ● ●● ● ●●● ●●●● ●● ● ●●● ● ● ● ● ● ● ● ● ● ●● ●●●●● ●● ● ● ● ● ● ●●●●● ●●●●●●●●●●●●● ●●●●●●●● ● ●●●●●● ● ● ● ●●●● ●●●●●● ● ● ● ●●●●● ● ● ●● ●●● ● ●●● ● ●●●●●● ● ●●● ● ● ● ● ● ●●●●●● ● ● ●● ●● ● ●● ● ●●●● ● ● ● ●● ● ● ● ● ● ●●●● ● ● ● ●●● ●● ●● ● ● ● ● ●●● ● ●●●●●● ● ● ●● ●●● ● ●●●●●●● ● ●● ●● ●● ●●●●●●● ●● ●●● ●●●●●● ● ● ● ● ● ●●●●●●●● ● ●●●●● ● ● ● ●●● ● ●● ● ●●●● ● ● ● ●●● ●● ● ● ● ● ●● ●●●●● ● ●● ● ●●●● ● ● ●● ● ● ● ● ●● ●● ● ● ●●●●●●●●● ● ●●● ● ● ●●● ●● ● ●●●● ● ● ● ●●●●●●●● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ●●●● ● ● ● ● ● ●● ●●● ● ●● ●● ● ● ●●●●● ● ● ● ● ● ● ● ● ● ● ●● ●●●● ● ●●●● ●● ● ● ● ●● ● ●● ●● ● ● ● ● ●●●●●●●●● ● ● ● ●●● ● ● ● ●●●●●● ●●● ● ● ●● ●● ●●●●●●● ●●● ●● ● ●●● ● ● ● ● ●●● ● ●●●●● ● ● ● ● ●● ● ● ● ●●●●●●●●●●● ● ●●●●●●●●●●● ● ●●● ●● ● ●●● ●● ●●● ●● ● ● ● ● ● ●●● ● ● ● ● ● ●●● ● ●● ●●●●● ●●●●●● ● ● ●●● ●●● ● ● ●●●● ● ●● ● ●● ● ● ●● ● ● ● ● ● ●●● ● ●●●●●●●●●●●●●●●●●●● ● ● ●● ●●●● ●●●● ● ● ● ● ●● ● ● ● ●●● ● ●●●●● ● ●●●● ●●● ● ●●●● ● ● ●● ● ●● ● ● ●●●●●●●● ●●●●●●●● ●● ● ●● ● ●●● ● ● ●● ●●● ●●●● ●● ● ●●●●● ●● ●● ●● ●●●●●●●●●● ● ●●●●●●● ● ●● ●● ● ● ●●●●●● ● ● ●●●●●●●●●● ● ● ● ● ● ●● ●● ● ● ●●●● ● ● ● ● ●●● ● ●●● ● ● ● ●● ● ●●● ● ● ●● ●● ● ● ● ●●●●●●●●●●● ● ●●● ● ● ● ● ●● ● ● ●● ●●●● ● ●● ●● ●●●● ● ●● ●● ● ●●●●●●● ● ●●● ● ●● ● ●● ● ●● ● ●●● ●●●●● ● ● ● ● ●●●●●● ●● ● ●●● ●●● ●● ● ●● ● ●● ● ●●●●●●●●●● ●● ● ● ●● ●● ● ●●●● ● ● ● ● ● ● ● ● ●● ●●● ● ●●●●●● ● ●●●●● ● ● ●●●●● ● ●● ● ●●●● ● ●●●● ● ● ●●●●● ● ● ● ●●●●● ● ● ● ● ● ●●●●●● ● ● ● ● ● ●●● ● ●● ● ● ●●●●● ● ●●●●●● ● ● ● ● ● ● ●● ●● ● ●● ●● ●●●● ● ●●●● ●●● ●● ●●● ● ● ● ●C5SizeWeight−0.4−0.20.00.20.40 1000 2000 3000●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●● ●● ●●●● ●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●C1● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●● ●●●●●● ●●●●●●●●● ●●●●●●●● ●●●●●●●●● ●●●●● ●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ● ●●●● ●● ●●● ●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●● ● ●● ●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ● ●● ●●●●●● ●●●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●● ●●● ●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●● ●●●●● ●● ●●● ● ●●●●●●●●● ●●●●●● ●●●●●●● ● ●● ●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●● ●●●● ●●●●● ●●●●●●●●●●●● ●●●● ●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●●●●●●●●● ●●●●●● ●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●● ●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●● ●●●●●● ●●●●● ●●●● ●●●●●●●●●C2● ●●●●●● ●●●● ●●●●●● ●●●●●●● ● ●●● ●● ●●● ● ●●●●●●●● ●●●● ●●●●●●● ●●●●●●● ●● ●●● ●●●● ●● ●● ●●●●● ● ●●● ●●●●●● ●●●●●● ●●● ●●●●●●●●●●●● ●●● ●● ●●●● ● ●●●●●●●●● ●●●●●●●●●●●●●● ●● ● ●●●● ●● ●●●●●●●●●●●●●●● ● ●● ●●● ●●● ● ●●●●●● ●●●●●●● ●●●●●●●●●●● ●●●●●●● ●●●● ● ●●●●●● ● ●●●●●●●●●●●●●●●●●●●● ● ●● ●●●●●●●●●●●●●●●●● ●● ●● ●●● ●●●●●●●●●●● ● ●● ●●●●●●●●● ●● ●● ●● ● ●●●● ● ●● ●● ● ●●●●● ● ●● ●●●●●●●●●●● ●●●● ●● ●●● ●●●●● ●● ●●●●●●● ● ●●●●●●●● ●●●● ● ●●●●●●●● ●●●●●●●●●●●●●● ● ●● ●●●●● ● ●●●● ● ●● ●●●●● ●●●● ●●●●● ●●●●●● ●●●●●●●● ●●● ●●●●●●● ●●●●●●●●●●●●●● ● ●●●●●● ● ●● ●●●●●●●●● ●● ● ●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●● ●● ●● ●●● ●●●●●●●●●●● ●● ●● ●●●● ●●●●●●●●● ●●● ●●●● ●●● ●● ●●●●●●●●●● ●●●●●●●●●●●●●●● ●● ●●●●●●●●●● ●●●●●●●● ●●●● ●●●●●● ●●●●●●●●●●●●C3−0.4−0.20.00.20.4●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●● ● ●●●●● ●●●●●●●●●● ●●●●●●●●●●●●● ●●● ●● ●●●●●●●● ●●●●●●●●●● ● ●●●● ●●● ●●●●●●●●●●●●●●●● ●● ●●●●●●●●●● ●● ●●●●●●●●●●● ●● ●● ●●●●●●●●●● ●●●●●●●●●●●●● ●●●● ●●●●●● ●●●●●●●●●●● ●●● ● ●●●● ●●●●● ●●●●●●● ●●●●●●●●●●● ●● ●●●●● ●●●●● ●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●● ●●●●●● ●●●● ●●●●● ●● ● ●●●●●●●●●●● ●● ●●● ●● ● ● ●●●●●● ● ●●●●● ●●●●●● ●● ●●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●● ● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●● ●●● ●●●●●● ●●●●●●● ● ●●●●● ●●●●● ●●●●●●● ●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●● ●●●●● ●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●● ●●●●●●●●●● ●●●● ●● ●●●●●●●●●●●●●●●●● ●●●C4−0.4−0.20.00.20.4●●●●●●● ●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●● ● ●●● ● ●●●● ●●● ●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●● ●●●● ●●●●● ●●●● ● ●●●●●●●●●● ● ●●●●●●●●●●●●● ●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●● ● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●● ●●●●●●● ●● ●●●●●●●●●● ●●●●●●●●●●●●●●● ●●● ● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●● ● ●●●●●●●●●●● ●●●●●●● ●●●● ●●●● ●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●● ●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●● ● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●C5SizeAUC0.30.40.50.60.70.80 1000 2000 3000●●●● ● ●●●●●●●●●●●●●● ●●●●●● ● ● ●●● ● ●● ● ●●●●●● ● ● ●●●●● ● ●● ●● ● ● ● ●●● ● ●● ● ●●●●●●●●●● ● ●●●●●●● ●● ● ●●● ●●●●● ● ●●●●●●●●●●● ●● ●● ● ●●● ●●● ● ● ● ●●●●●● ●● ●●●● ● ● ●●●●●● ●●●● ● ●●● ● ●● ●●●●●●●●●●● ● ●●●● ● ● ●●●●●●●●●●●● ●● ●●●●●● ●● ●● ● ● ● ● ● ●●●●●●● ●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●●● ● ● ● ● ● ●● ●● ●●●●●●●● ● ●●●●●●● ●● ●●● ●● ●● ● ●● ●● ● ●●●●● ●● ● ●●● ●●●●●●●●● ●●● ●●● ●●●●●●●● ●●●●●●● ● ● ●C1●●●●●●●●●●●● ●●●●●● ●●●● ●● ●●●●● ●●●●●● ●●●●● ●●●●● ●●● ●●●● ●●●●● ● ● ●●● ● ●●●●●●●●● ●● ●●●●● ●●● ●●●●●●●●●●●●●●●●● ●●●●●●● ●● ●● ●●●● ● ●●●● ●●●●●●● ●●●● ●●●●●●●●●●●●●●● ●● ● ●●●●● ● ● ●●●●●●●●●●●● ●●●● ● ●● ●●● ● ● ●●●● ● ● ● ● ● ● ●●●● ●● ●●● ●●●●● ●●●●●●●●●● ●● ● ●●●●●●●●●●●●●●●●●● ● ●●●● ● ●●● ●●●●● ● ●● ●●●●● ●●●●●● ● ●●●●●●●●●●● ● ● ●●●● ● ●●●●●● ● ●●●●● ● ●●●●●●●●●● ●● ●●●● ●●●●●●●●●●● ●●● ● ●●●● ● ●●●●●●●●● ●●●●● ● ●● ●●●●●●●●● ● ●●●●●●● ● ●● ●●●●●●● ●●●● ● ●● ●●● ● ● ●● ● ●●●● ●●●● ●●●●● ● ● ●●● ● ●●●●●●●●●● ● ●●●●●●●●● ● ●●●●●●● ● ●● ●●● ●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●●● ● ●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●● ●●●●●● ● ●●●● ● ●●● ●●● ●●●●●●●●● ● ● ●●● ● ●● ● ●●●●● ●●●●● ●●●● ● ●●●●●●● ●● ● ● ●●●●●●●●●●●● ● ●●●●●●●● ●● ●● ●●● ● ●●●●●●● ●●●● ●●●●●●●●●●●● ●● ● ●●● ●● ●●●●● ●● ● ●● ●●●●● ● ●● ● ●●●●●●●● ● ●● ●●●●●●●●●●● ●● ●● ●●●●●● ●●●● ● ● ● ●●●●●●●● ●● ●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●● ●● ●●● ●●● ●● ● ● ● ●●●●● ●●●●●● ●●● ● ● ● ● ●● ●● ● ●● ●●●●●●● ●●● ● ●●●●●●●●● ●●● ●●● ● ● ●● ● ●●●●●● ● ●●● ● ●●● ●●● ● ●●●●●●●●●●●● ● ● ●●●●●●●●● ● ●●●●●●●● ●●●●●● ● ●● ● ●●●●●●●●● ●●●●●●●●● ● ●●●●● ●●●● ● ●●● ●●●●●● ●● ●●●● ● ● ● ●●● ● ●●● ●●●●●●●● ●●●●●●●●●●● ●●● ●●●●● ●● ●●●● ● ●● ●●●●● ● ● ●●● ●● ●●●●●●● ●●● ●● ●●●●●●● ●●●●● ●●● ●●●●●●●●●●●●●●●● ● ●●● ●● ● ●●● ●● ●●●● ●●● ●●●●●●● ●● ●● ●●●● ●●● ●●●● ● ●●●●●●●●●●●●● ●● ● ●●●●●●●●●●● ●●● ● ● ●●●●●●● ●● ●●●●●●● ●● ● ● ●●●●●●●●●● ● ●●●●● ● ●●● ● ●●●●●●●●● ●●●●●● ●●●●●●●●●●● ● ● ●●● ●●●● ●●●●●●●●● ●● ●●●● ● ●●●●●●●● ● ● ●●●●●●●●● ●● ● ●● ●●●● ● ●●●● ●●●●●●●●●●●● ● ●●●●●●● ● ● ●●●●●●●● ● ●●●● ● ●●●●●●●●● ●●●●●●● ●●● ●●●●●● ●● ● ●●●●●● ● ●●● ● ●●●●●● ●●●● ●●●●● ●●●● ●●●●●● ● ●●●●●●●●●● ● ●●●● ●●●●● ● ●●● ● ●●● ●●●●●●● ●●●●●●●●●● ● ●●●●● ● ●● ●●●●●●●● ● ●● ●● ● ● ●● ● ● ●●●●● ● ●● ● ●●●●●●● ● ● ●● ●● ●●●●●●● ●● ● ●●●●● ● ●●● ● ● ●● ● ●● ●● ●●●●●●●●●●● ● ●● ● ●●●●●● ● ●● ● ● ●●●●●●●●●●●●● ● ●●●●● ● ●●●●●● ● ●●●●● ●●●●●●●●●● ● ● ●● ● ●●●●●●●●●●●● ●● ●●●●● ●●● ●●●●●●● ● ●●●● ● ● ● ●●●●●● ● ● ●●● ● ●●●●●●●●●●● ●●●●●● ● ●●●●● ● ●●● ●● ●●● ● ●●●● ●● ●● ●●●● ● ●●●● ●● ●● ● ●●●●●●●●●● ●● ● ● ● ● ●●●●●●● ● ●●●●● ● ●● ● ● ● ● ●●●●●●● ● ●● ●● ●● ●●●●●●● ●●● ● ●●●●●●C2●●●● ●●●● ●●●●●●●●●●●●● ● ● ● ●●●● ●●●● ●●●●●●●●●● ●● ● ● ●●●●● ●●●●●●● ●●●●● ●● ●●●●●● ● ●● ●●●●● ● ●● ●● ● ●●● ● ●●● ● ●● ●●●●●●●● ●●● ● ● ●●●●●●●●●●●● ●●● ●●●●●●●●●●●●● ●●● ●● ● ●●●●●●●●●●● ● ●●● ● ●●●●●●● ●●●●●●●● ●●●●●●●● ● ●●● ●●●●●●●●●● ● ● ●●●●●● ● ●● ●● ●●●●●●●● ● ●●●● ● ● ●●●● ●●● ●● ●●● ●●●●● ●●●●●● ● ●●●●● ●●● ● ●● ●●● ● ●●●● ●●● ●●●●●●●●●●●●●● ●●● ●● ● ●●●●●●●●●●●●●●●● ●●● ● ●●● ● ●● ●●● ● ●●● ● ● ●●●●●● ● ● ●●●● ●●●●●●● ●● ●●● ●● ● ●●●● ● ● ●● ●● ● ●●●●●●●● ●● ●●●● ●● ●●●●●● ● ●● ●●●●● ● ●●● ●●●●●● ●● ● ●●● ● ●●●● ● ● ● ●●● ● ●●● ●●●●●●●●●● ● ●●● ●● ●●●●● ● ●●● ●●●●●●●●●●●●●●● ● ●● ●●●● ●● ●●● ● ●●●● ● ●● ●●●●●● ●● ●●● ●●●●● ●●●●● ● ● ●● ●●●● ●● ● ● ●●● ●●●●●●●●●● ● ● ●● ●● ● ●●● ●●● ●●● ● ● ● ●●● ●● ● ●●●●●● ●●●●● ●●●●● ●●●● ●● ● ●● ● ●●●●●●● ● ●●●● ● ● ●●●●● ● ● ●●●●●●●●●● ● ●●●● ●● ● ●● ●● ● ●●●● ●● ●●●●●● ● ● ●● ● ●● ●● ●●● ●● ●● ●● ● ●●●●●● ●●● ● ● ●●● ● ●● ● ●●● ● ●● ●●●● ●●●●● ●● ●●● ●●●● ● ●● ●●●●●●●●●● ●●● ●●●●●●● ● ●● ● ● ●●●●●● ● ● ● ●●●● ● ● ●●●●●●● ● ●●●●●●●●●●●●●●●●●C30.30.40.50.60.70.8●●●●●●●●●● ●● ●●●● ● ● ● ●● ●●● ● ●●●● ●●●● ●● ●●●● ● ●●●●●●●●● ●●● ●● ●● ●●●●● ● ●● ●●● ● ●●● ●●●●●●●●●●●●●●●● ●●●● ●● ●●●●● ●●●●●● ● ●●●●●●●●●● ● ●● ● ●●●●● ● ●●● ●●● ●●●●●●●●●●● ●●●●●● ●●●● ●●●●●●● ● ● ●●●● ● ●●●●●●●● ●●●● ●●●●●●● ● ●● ● ●●●●●●●●● ●● ●● ● ●●●●● ●●●●●● ●●●●●●●●●●●●●● ● ● ●●●● ●●●●● ●●● ● ●● ●●●●●●●●●● ● ● ● ●● ●●●●●●●●●● ● ● ●● ●● ● ● ● ●●●●● ●●●●●● ●●●●●● ●●●●●●●● ●● ●●●●●●●●●●●●●●● ●●●●●● ● ● ●●●●● ● ● ●●● ● ●●●● ● ●●●● ●●●●●● ●●●● ●●●●●● ● ● ● ●● ●● ● ●● ● ●●●●● ●●●●● ●●●●●●● ●●●●●●●●●●● ● ●●●● ● ●● ●●●●●●●●●●● ● ●●● ●●●●●●●●● ●● ●●●●●●●●●●●● ●●● ●● ● ● ●●●●●● ● ●●●●● ●●● ●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●● ● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●● ●●●●●●●●●●● ●●●●● ● ●●● ●●●●●●●●● ●●●●●● ●●●●● ●● ● ●●● ● ●●●●●●●●●●● ● ●●●●●●● ● ● ●●●●●●● ●●●●●●● ● ●●●●●● ●●●●●●●● ●●● ●●●●●●●● ● ●●●●●●●●●●●●●● ● ●●●●●●●●● ● ● ●●●●●● ●●●● ● ●●● ●● ●●●●●●●● ●● ●●● ●● ●●● ●●●●●●●●● ●●● ●●●●●● ●●●● ● ●●●●●●●● ● ●●●●●● ●●●● ●●● ●● ●●●●●●● ● ●●●●●●● ●●●●●● ●●●●●●●● ●●●● ● ●●●●● ● ● ●●●●●● ● ● ●● ●●●●●●● ● ●●● ● ●C40.30.40.50.60.70.8●●●●●● ● ●●●● ●●● ●●● ●●●●●● ● ● ● ● ● ● ● ●●●●●●●●●●●●● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●● ● ● ● ●●●●●●●● ●●●●●●●●●● ● ●● ●●●●●●●●●● ●●● ●●●● ●●●●●●● ●●●● ●●●● ●●●●●●●●●●●● ●●● ● ●● ● ●● ●●●●● ● ● ● ●●● ●●● ● ●●●● ● ● ● ●●●●●● ● ●● ●●●●●●●●●●●●●● ● ● ● ●●●●● ●●●●●●●●●●●●●●●●●●● ●● ●●●●●● ● ●●● ● ●●● ● ●●●●● ● ● ● ● ●●●●●● ●●●● ● ●●●●● ● ●●●●●● ● ●●●●●●●● ●● ●●●●● ● ●● ●●●●●●● ●●●●●●●●● ●●●●●● ● ●● ● ● ●●●● ●●●●●●●●●●●●●● ● ● ●●●● ● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ● ●●● ●●●●●● ●●●●● ●●● ● ● ● ● ● ●●●● ●● ●●● ●●●●●●●●●●●●●● ●● ● ●● ●●●●●●● ●●●●●● ●● ●●●●● ●●●●●●●●●●●● ●● ●● ●●●●●●● ●●●●● ●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●● ● ●●●●●●●●●●●● ●●●●●● ● ● ● ●●●●●●●● ●●● ●● ●●●●●●● ●●●●●●●●●●●●●● ●●● ●●●●●● ● ●●●●●●●●●●●●●● ● ●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●● ● ●●● ● ●●● ●● ● ●●●●●● ● ●●●●●●●●●●●●●●●●●● ●●● ●● ●●● ● ●●●●●●● ●●●●●● ● ●● ●●● ●● ●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●● ●● ● ●● ●●●●●●● ● ● ●● ●●●●●●●●●●●● ● ●●●●●●●●●●●●● ●●●●● ●● ● ●●● ●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●● ●●● ●●●●●●●● ●●●●●● ●●●●●● ●●●●● ● ●●●●●● ●●●●● ● ●● ● ●● ●●●●● ●●●●●●●●●●●●●●●●● ● ●●●●●●●●●● ● ●●●● ●●●●●●●●●●●●●●● ●● ●●● ●● ● ●●●●● ●● ●●●● ● ●● ●●●● ● ●●●●●●●● ● ●● ●●● ● ●● ●●●●●●●●●●●●●● ● ● ● ●● ●●●●●● ●● ●●● ●●● ●●●●●●● ●●● ● ● ●●●● ●●● ●●●●●●●● ●●● ●●● ● ●●●●●● ●● ● ● ● ●●●● ● ● ● ●● ● ●●●●●● ● ●●●●●● ●● ● ●●● ●● ●●●●● ● ●●● ●●●● ● ●●● ●● ●● ●● ●● ●●●●● ●●● ● ●●●● ●● ● ●● ●●● ● ●●●● ●●●● ●● ●●● ● ●● ●●●● ● ●●● ●● ● ● ● ● ●● ●●● ●● ● ● ●● ● ●● ●● ● ●●●●●● ● ●●●●● ●●●●●●●● ●●●●●●●●●●●● ●●● ●●● ● ●●●●●●● ● ●● ●●●● ● ● ●● ●● ● ●●● ●● ●●●●●● ● ●● ●● ●●●●●●●● ●●● ●●● ● ● ●●●●● ● ● ● ●● ●● ● ● ●● ● ●●● ● ●● ●● ●●●●●●●●● ●●● ●●●● ● ● ●●● ●● ● ●C5Figure 4.9.: AUC and weight versus set size for the set centroid statistic, using the centroidclassifier.86

4.3. ResultsHerceptin), whereas ER+ (estrogen-receptor positive) cases should be treated by tamoxifen.The basal-like subtype, mainly defined by a set of intermediate filament genes (keratins)that histologically stain cells with spindle form, and are located further from the milk ducts,represents a collection of cancer cases that are harder to treat. However, there have beenconcerns about the stability and robustness of subtypes defined from data, especially fromthe relatively small datasets originally used (Pusztai et al., 2006). In addition, the definitionof subtypes does not necessarily correspond to clinical outcomes such as distant metastasis,since these phenotypes were not taken into account in the clustering procedure.The basal-like subtype largely corresponds to the “triple-negative” breast cancers, so namedbecause they are characterised by being ER−/PR−/HER2− (estrogen receptor negative, progesteronereceptor negative, and HER2 negative). Traditionally, ER and PR status has beendetermined using immunohistochemical assays (protein staining of tissue).Consequently, it has been suggested that a more clinically relevant definition of the subtypeswould be based on gene expression (mRNA levels) rather than immunohistochemistry(Desmedt et al., 2008; Loi et al., 2007; Wirapati et al., 2008), directly taking into accountinformation about clinical outcomes, thus defining three molecular subtypes: ER−/HER2−,ER+/HER2−, and HER2+. The ER−/HER2− subtype roughly corresponds to the intrinsictriple-negative class, but excludes PR status, since its overall mRNA level in breast tissue isnot high, and the role of progesterone-receptor status in defining molecular subtypes is currentlyunclear. The ER+/HER2− roughly corresponds to the intrinsic luminal A/B classes,and the HER2 subtype is the same across the two classification systems.Our results above show a strong cell-cycle signature as highly prognostic of distant metastasis,supporting existing findings (Desmedt et al., 2008). The association of cell-cycle genesto increased risk of metastasis has been mainly attributed to the breast cancer cases that areER+ (estrogen-receptor positive) (Buyse et al., 2006; Loi et al., 2007), which comprise the majorityof the breast cancer population. To classify the samples in our data into their molecularsubtypes, we followed the procedure described by Desmedt et al. (2008), and assessed the listof gene modules, which are intended to represent different biological functions such as tumorinvasion, immune response, angiogenesis, apoptosis, proliferation, and ER and HER2 signaling.We clustered the samples based on their ER and HER2 module scores (a three-componentGaussian mixture model with diagonal covariance using R package flexmix (Leisch, 2004))into the three molecular subtypes: ER−/HER2−, ER+/HER2−, and HER2+ (Figure 4.10).The number of cases in each subtype is shown in Table 4.4. Subsequently, we reran ouranalysis, which consists of training the centroid classifier on the MSigDB set statistics, oneach subgroup. Table 4.5 shows the top gene sets for each subgroup for the set centroidstatistic. The set centroid, set medoid, and set median show enrichment for genes from theAURKA module in the ER+/HER2− as expected, and to a lesser extent an immune responsesignature (STAT1 module) in the ER−/HER2− subtype, manifesting as IFN-γ-related sets.87

4. Gene Sets for Breast Cancer PrognosisERBB2−1 0 1 2 3●●●●●●●●●●●●● ●●●●● ● ●●●●●●● ●● ●● ●1 ●●● ●●●●● ●●●●●●●●●●●●●●●● ● ●●●●● ●●●●23−2 −1 0 1 2ESR1Figure 4.10.: Expression of ESR1 (ER) versus ERBB2 (HER2) for the combined dataset. Amixture of three Gaussians is fitted to the data. Clusters 1, 2, and 3 representthe ER−/HER2−, ER+/HER2−, and HER2+ subtypes, respectively.88

4.3. ResultsThese results show that the gene sets identified as strongly associated with metastasis cansuccessfully reproduce previously known biological mechanisms of breast cancer metastasis.Additionally, the sets are diverse enough to capture different cellular mechanisms (cell cycleversus immune response) when applied to data with different cancer subtypes. SupplementaryFigure A.8 showing the expression of several genes related to breast cancer prognosisand molecular subtype, and the correspondence with the ER/HER2 subtype classification.Apart from the unsurprising association of ESR1 (ER) and ERBB2 (HER2) with the subtypes,GATA3 and FOXA1 especially stand out as genes with expression associated with theER/HER2 subtype, being under-expressed in the ER−/HER2− subtype. Both GATA3 andFOXA1 are known breast cancer prognosis indicators (Albergaria et al., 2009), and suppressionof GATA3 has been linked to loss of regulation of tumor differentiation (Dydensborg etal., 2009; Kouros-Mehr et al., 2008).We also investigated how the prognostic sets derived from different set statistics overlappedwith the genes in the modules defined by Desmedt et al. (2008). Figure A.7 showsKolmogorov-Smirnov plots for each set statistic in each ER/HER2 subtype separately. Thesets were sorted in increasing order of centroid classifier weight w. For each subtype, theline in the plots moves up one step of size 1/m (m is number of sets containing at least oneof Desmedt’s genes) when the set contains at least one genes belonging to a module definedby Desmedt et al. (2008) and down when not (step size of 1/n where n is number of sets notcontaining at least one genes). Each molecular subgroup uses a different set of Desmedt’smodule genes: HER2+ is compared against modules PLAU and STAT1, ER−/HER2− againstmodule STAT1, and ER+/HER2− against module AURKA. In contrast to the other set statistics,the set PC, set t-statistic, and to some extent the set U-statistic, exhibit more pronouncedenrichment of Desmedt’s module genes at the top and bottom of the sorted set list, indicatingthat the set PC and the set t-statistic are the most concordant with the genes defined by theDesmedt modules. Therefore, the set PC seems to be the best set in terms of reproducingprevious module definitions, but does not perform as well as the other set statistics in termsof predicting metastasis (see Section 4.2.2). The set t-statistic is perhaps a better compromisein terms of predictive ability and agreement with Desmedt’s modules.4.3.6. Do the Gene Sets Point to the Same Biology as the Genes?We next investigated whether the top gene sets reflect the same underlying biology as thetop genes. In the combined data, we trained three types of classifiers: l 2 -penalised logisticregression (R package penalized (Goeman, 2008)), SVM with linear kernel (R packagekernlab (Karatzoglou et al., 2004)), and the centroid classifier. Each classifier was trainedon the genes and the gene set statistics (set centroids), for a total of six models.For each model, we ranked the features by the absolute value of their weights. We thenselected the top 512 genes, which is a high enough number of genes producing a high AUC89

4. Gene Sets for Breast Cancer PrognosisClass # MSigDB Set Cat. Description SignER−/HER2− 1 chr7q12 C1 Genes in cytogenetic band +1chr7q122 COLLER MYC DN C2 Genes down-regulated by MYC −1in 293T (transformed fetal renalcell).3 IFNGPATHWAY C2 IFN gamma signaling pathway +14 GRANDVAUX IFN NOT IRF3 C2 Genes up-regulated by interferonalpha,beta+1UPbut not by IRF3 in Ju-rkat (T cell)5 GNF2 ST13 C4 Neighborhood of ST13 −16 GNF2 CD48 C4 Neighborhood of CD48 +17 GNF2 GLTSCR2 C4 Neighborhood of GLTSCR2 −18 MENSE HYPOXIA DN C2 List of Hypoxia-suppressed genes −1found in both Astrocytes andHeLa Cells9 HSA03010 RIBOSOME C2 Genes involved in ribosome −110 GCM TPT1 C4 Neighborhood of TPT1 −1ER+/HER2− 1 GNF2 MKI67 C4 Neighborhood of MKI67 −12 GNF2 TTK C4 Neighborhood of TTK −13 GNF2 HMMR C4 Neighborhood of HMMR −14 GNF2 CCNA2 C4 Neighborhood of CCNA2 −15 GNF2 SMC2L1 C4 Neighborhood of SMC2L1 −16 GNF2 ESPL1 C4 Neighborhood of ESPL1 −17 GNF2 CDC20 C4 Neighborhood of CDC20 −18 GNF2 H2AFX C4 Neighborhood of H2AFX −19 GNF2 RRM2 C4 Neighborhood of RRM2 −110 ZHAN MM CD138 PR VSRESTC250 top ranked SAM-defined overexpressedgenes in each subgroupPRHER2+ 1 chr4p C1 Genes in cytogenetic band chr4p −12 chr1q11 C1 Genes in cytogenetic band +1chr1q113 DAC FIBRO DN C2 Downregulated by DAC treatment−1in LD419 fibroblast cells4 GNF2 MKI67 C4 Neighborhood of MKI67 −15 GNF2 CCNA2 C4 Neighborhood of CCNA2 −16 GNF2 TTK C4 Neighborhood of TTK −17 GNF2 H2AFX C4 Neighborhood of H2AFX −18 GNF2 HMMR C4 Neighborhood of HMMR −19 CROONQUIST IL6 RAS DN C2 Genes dowmregulated in multiple −1myeloma cells exposed to the proproliferativecytokine IL-6 versusthose with N-ras-activating mutations.10 CROONQUIST IL6 STARVE C2 Genes upregulated in multiple −1UPmyeloma cells exposed to the proproliferativecytokine IL-6 versusthose that were IL-6-starved.Table 4.5.: Top 10 MSigDB sets for ER/HER2 molecular subtypes, chosen by the centroidclassifier using the set centroid statistic. Sign=−1 if expression is negatively associatedwith long-term survival, and +1 for positive association with long-termsurvival.−190

4.3. ResultsClassifier # MSigDB set p-value matches set sizeCC 1 GNF2 MKI67 < 1.00×10 −40 31 472 GNF2 TTK < 1.00×10 −40 29 573 GNF2 CCNA2 < 1.00×10 −40 48 994 GNF2 HMMR < 1.00×10 −40 42 785 GNF2 SMC2L1 < 1.00×10 −40 26 516 GNF2 CDC20 < 1.00×10 −40 46 917 GNF2 ESPL1 < 1.00×10 −40 27 588 GNF2 H2AFX < 1.00×10 −40 24 549 GNF2 RRM2 < 1.00×10 −40 32 6810 chr1q11 2.32×10 −6 2 4SVM 1 chr7q12 6.23×10 −4 1 12 chr3q11 1.00 0 83 chrxq 1.00 0 24 BYSTRYKH RUNX1 TARGETS GLOCUS 8.06×10 −3 1 135 TESTIS EXPRESSED GENES 7.28×10 −7 4 1076 chr22q 1.00 0 67 REGULATION OF G PROTEIN COUPLED 4.28×10 −4 2 48RECEPTOR PROTEIN SIGNALING PATH-WAY8 chr11p14 1.00 0 209 TERCPATHWAY 1.00 0 1510 chr1q41 2.02×10 −4 2 33LR 1 chr3q11 1.00 0 82 chr22q 1.00 0 63 TERCPATHWAY 1.00 0 154 chrxq 1.00 0 25 BYSTRYKH RUNX1 TARGETS GLOCUS 8.06×10 −3 1 136 HSA00130 UBIQUINONE BIOSYNTHESIS 1.00 0 87 chr20p 1.00 0 28 chr1q41 1.29×10 −6 3 339 chr3q12 1.00 0 2310 BETA TUBULIN BINDING 1.00 0 12Table 4.6.: Top 10 sets using the set centroid statistic using different classifiers, and the p-value for the size of the intersection between the top individual genes and the topgene sets (Fisher’s exact test, one sided). CC is centroid classifier, LR is logisticregression.91

4. Gene Sets for Breast Cancer Prognosisafter which AUC does not increase much, and is a much higher number of genes than manypublished metastatic signatures. Other cutoffs (256, 1024, 2048) exhibited similar results(not shown). For each of the top ranked sets, we then checked how many of the top rankedgenes belonged to that set, using the same classifier (that is, centroid genes to centroid sets,logistic regression genes to logistic regression sets, SVM genes to SVM sets). The numberof individual genes that mapped to each set was quantified using a one-sided Fisher exacttest, in order to check whether the size of the intersection between the top sets and the topindividual genes was significantly more than due to chance.As shown in Table 4.6, there is significant overlap between the top sets and top genes foundby the centroid classifier. In comparison, both logistic regression and the SVM show verylittle overlap. In other words, the top sets ranked by the centroid classifier, using the setcentroid statistic, are over-represented in the top genes selected by the centroid classifier,indicating the same underlying biological processes associated with metastasis. This doesnot seem to be the case for the other classifiers — the top sets found by them tend to notcontain the top genes identified when considering individual genes. While this phenomenonis not necessarily to be taken as a shortcoming of models such as logistic regression, it servesas further confirmation that for the centroid classifier, at least, the biology identified by thegene sets seems to be similar to that identified by the individual genes. Further work isneeded to evaluate the underlying reasons for the differences between the models. However,for the centroid classifier, the top sets are consistent with signatures for cancer progresssion,as discussed in Section 4.3.4.4.4. SummaryWhile the understanding of breast cancer etiology and prognosis have progressed using geneexpression data, one of the main challenges has been robust identification of which genes arehighly associated with metastasis, with different studies producing gene lists with little or nooverlap, raising doubts about the biological interpretation of these genes and the robustnessof the results.We have shown that classifiers based on sets of genes, rather than individual genes, havesimilar predictive power but are more stable and more reproducible, both within datasetsand between datasets, and as a result may facilitate increased understanding of the biologicalmechanisms relating to breast cancer prognosis. The likely explanation is that the expressionof any given gene is a function of both its contextual regulation — regulation under varyingconditions both observed and unobserved (such as noisy transcription) — as well as inherentvariability due to germ-line variations and differences in host-tumour response betweenindividuals (Morley et al., 2004). The former variability can be used for prognostic purposes.However, the latter reduces the prognostic accuracy since patient-level variability is typically92

4.4. Summarynot considered when building prognostic models. The lack of predictive improvement fromusing gene sets has recently been observed by Staiger et al. (2011), with possible reasons includinghigh levels of noise in the data, the simplicity of the set assignment methods (pathwayinformation is usually not detailed nor directional), and the crudeness of the set statistics inextracting meaningful signal from the sets’ expression. Furthermore, Staiger et al. (2011)and Venet et al. (2011) have shown that sets composed of randomly chosen genes perform aswell as those based on MSigDB, at least in regards to predictive ability. However, as Venetet al. (2011) demonstrate, more than 50% of the genes assayed in most human microarrayexperiments show some correlation with the cell cycle process, which is known to be associatedwith the outcome. Therefore, even gene sets composed of randomly selected genes arelikely to have some genes associated with the cell cycle process and therefore indirectly withthe outcome. The chance of including these predictive genes should increase as the set sizeincreases. In contrast, we did not find evidence for an association between the set size andthe predictive ability of the set. Clearly, future work should include an investigation of thetop predictive gene sets and dissecting their constituent genes to better understand whichgenes in the set confer the predictive ability and which are potentially superfluous.We have found that the C4 computationally-derived sets tended to produce better classifiersof metastasis than sets from the other MSigDB categories. This difference may be due tothe fact that C4 sets are coexpressed genes based on datasets with a large number of cancersamples and were designed to be associated with the cancer phenotype. These results suggestthat there is more prognostic value in large-scale systematic efforts to compile lists of sets ofcoexpressed genes from large datasets (Brentani et al., 2003; Segal et al., 2003), as opposed toapproaches that build sets from limited pathway and GO knowledge. In order to be useful forphenotypes other than breast cancer metastasis and to potentially increase statistical power,these datasets should cover a wide range of diseases and phenotypes.Importantly, our results are in agreement with current understanding of the main driversof breast cancer metastasis, namely proliferation for ER+/HER2−, immune response forER−/HER2−, and tumour invasion and immune response for HER2+ (Desmedt et al., 2008),suggesting that the stability advantages afforded by gene sets do not come at the cost ofbiological interpretation. Apart from patient prognosis, there is also potential to apply thesame approach to other phenotypes, such as for understanding the biological mechanismsresponsible for response and resistance to anti-cancer therapies (Li et al., 2011).We have used simple set statistics to represent gene set activity. These statistics are computationallytractable and depend on predefined set memberships. Some set statistics are notalways sensible; for example, the average expression of a gene set of may not be meaningfulwhen the genes are negatively correlated or uncorrelated; different statistics may be optimalfor different gene sets. Moreover, these statistics ignore the structure and temporal dynamicsof the gene networks, which could be important in deciphering causal relationships between93

4. Gene Sets for Breast Cancer Prognosisgenes and phenotypes. However, reliable information about the detailed structure of humangene networks is currently limited, relative to pathways in simpler organisms such as yeastwhich have been well characterised.With respect to predictive ability of metastasis, we and others (Chuang et al., 2007; Kimand Kim, 2008; Staiger et al., 2011; van ’t Veer et al., 2002) have applied wide-range of machinelearning methods to breast cancer data, including those based on individual genes and genesets, and have found similar results. While future improvement from examining even largerdatasets cannot be ruled out, there is strong evidence to suggest that the upper limit hasbeen reached. Predictive ability is hindered by variance, resulting from several factors. Thefirst factor is microarray measurement noise. Noise may be mitigated by technical replicates(several microarrays per patient sample), larger sample sizes, and, ultimately, replacementof gene expression microarrays with other technologies such as RNA-Seq (Shendure, 2008)when those are mature and cheap enough. A second factor is variability due to unmeasured(latent) factors that vary between patients and within patients across time — gene expressionmicroarrays do not take into account many other biochemical factors in the cell, such asproteins and other metabolites, and epigenetic effects. The molecular classification of breastcancer may be further refined in the future, based on these new forms of genomic data. Ifsuccessfully combined, these additional forms of information may increase predictive ability.A third factor limiting predictive ability is the inherent stochasticity of the metastatic process,as with all biological processes — it may be that even given “perfect” information about thebiological status of each patient, we may not be able to accurately predict their metastaticoutcome several years into the future, as it may essentially be a highly random event.Future WorkThere are other ways in which this data could have been analysed. First, survival modelssuch as Cox proportional-hazard models can be used to more fully take into account thetime-to-metastasis data, rather than arbitrarily discretising the outcomes into a binary variable.Second, other approaches take the gene set information into account without havingto aggregate features into gene set statistics. One approach could be based on group lassomodels (Jacob et al., 2009; Meier et al., 2008), where a lasso penalty is applied to the l 2 -normof each gene set. Such a penalty encourages selection of genes in a set, such that if one geneis selected to be in the model then other genes in the set can enter the model as well. Thegroup lasso approach is more flexible than the set statistic approach since it still operatesat the individual gene level, rather than operating on sets. Similarly to the set statistics,the group lasso still requires a definition of which genes belong to which sets. Another alternativeapproach could be an hierarchical model, either frequentist mixed-effects models orBayesian hierarchical models (Gelman and Hill, 2007). In both types of hierarchical models,genes can be analysed individually, but in addition set effects (such as each set having its94

4.4. Summaryown average expression level), and other effects of interest such as molecular subtype classificationand age can be accounted for. Again, such models potentially offer more flexiblemodelling of the data than set statistics. The downside of such models is that they aremuch more computationally demanding, which is an important consideration when analysinggenomic datasets that commonly have upwards of thousands of features. Third, instead ofusing fixed predefined sets, de novo set discovery would be applied to this data, where thesets are defined from the data on the basis of coexpression,using methods such as hierarchicalclustering, Gaussian mixture models, or latent Dirichlet allocation (LDA) (Blei et al., 2003)and related approaches (Liu et al., 2010; Savage et al., 2010). De novo discovery would likelyneed a large number of samples in order to produce stable and reproducible results. Suchan approach would be easier now that many gene expression datasets are publicly availablethrough repositories such as NCBI GEO (http://www.ncbi.nlm.nih.gov/geo) and Array-Express (http://www.ebi.ac.uk/arrayexpress), as long as the issues of suitable datasetnormalisation and integration across different platforms are taken into account. Fourth, univariatesummaries of gene expression levels in a set, such as the 1st principal component,may be incapable of capturing a substantial amount of variation in the set, requiring multipleprincipal components. In such a case, a multivariate model or a canonical correlation analysis(CCA) approach may prove more useful. The definition of breast cancer subtypes may changeas well, as larger and more comprehensive datasets are accumulated, leading to more stableand subtle disease classes, as demonstrated recently by Curtis et al. (2012).95

5Fast and Memory-Efficient Sparse LinearModels of Large Genome-Wide Datasets5.1. IntroductionOne of the challenges raised by recent advances in the genomics of complex phenotypes isthe prediction of phenotype given genotype, such as prediction of disease from SNP data.Successful identification of SNPs strongly predictive of disease promises a better understandingof the biological mechanisms underlying the disease, and has the potential to lead toearly disease diagnosis and preventative strategies. The question of predictive ability is alsoclosely related to the proportion of phenotypic and genetic variance that can be explained bycommon SNPs and the lively debate surrounding the “missing heritability” of many complexdiseases (Manolio et al., 2009). To quantify the genetic effect, we must fit a statistical modelto all SNPs simultaneously. Lasso-penalised models (Tibshirani, 1996) are well suited to thistask, since they perform variable selection — some model weights are exactly zero and thusexcluded from the model. In this way, lasso models remove the need for screening SNPs basedon univariable statistics prior to fitting a multivariable model of the phenotype (Wu et al.,2009).However, fitting models to genome-wide or whole-genome data is challenging since suchstudies typically assay thousands to tens of thousands of samples and hundreds of thousandsto millions of SNPs. With standard analysis tools, modelling genome-wide and whole genomedata is either impossible or extremely inefficient. For example, most existing analysis tools97

5. Fast and Memory-Efficient Sparse Linear Modelsrequire loading the entire dataset into memory prior to fitting the models, which is bothtime-consuming and requires large amounts of memory to store the data and fit the models.In order to perform simultaneous modelling of SNP variation across the genome and buildpredictive models of disease and phenotype, it is clear that there is a need for new tools thatare fast, not memory intensive, and easy to use.Here, we present the tool SparSNP, which is an efficient implementation of lasso-penalisedlinear models. SparSNP can fit lasso models to large-scale genomic datasets in minutes usingsmall amounts of memory, outperforming equivalent in-memory methods. Thus, SparSNPmakes it practical to analyse massive datasets without the use of specialised computing hardwareor cloud computing. SparSNP produces cross-validated model weights that can be usedto select the top predictive SNPs. SparSNP also allows the resulting models to be evaluatedfor predictive power, and for phenotypic and genetic variance explained.5.2. BackgroundSparSNP is an efficient implementation of l 1 penalised loss minimisation with linear andsquared-hinge loss functions, which we now discuss in more detail.5.2.1. Penalised LossAs introduced in Chapter 3, statistical models are fit by minimising a suitable loss function,such as linear loss for linear regression, logistic loss for logistic regression, and hinge loss forclassification. Any of these loss functions above can be penalised with l 1 (lasso) penalties andminimised to find the solutions as follows: (β ∗ 0 , β∗ )(β0 ∗ , β ∗ ) = arg minβ 0 ,β∈R L′ (β 0 , β) = L(β 0 , β) + λ ∑ ∣β j ∣. (5.1)pThe penalty λ ≥ 0 is user-specified and controls the degree of penalisation. A high l 1 penaltyencourages sparse solutions (many ˆβ j exactly zero for high-enough λ). In contrast, anothercommon penalty, the l 2 penalty, defined as ∣∣β∣∣ 2 2 = ∑p j=1 β2 j , induces proportional shrinkageof the estimates, but generally does not induce sparse solutions (Hastie et al., 2009a). Notethat the intercept term β 0 is not penalised to prevent the estimation from depending on thechosen origin for the response y (Hastie et al., 2009a). In practical terms, the lasso penaltycombines model fitting with variable selection, whereas the ridge penalty does not performvariable (feature) selection, requiring additional steps to select variables, such as discardingvariables with low absolute weight.pj=198

5.2. BackgroundThe lasso can be formulated in another form, the constrained form (Tibshirani, 1996),(β ∗ 0 , β ∗ ) = arg minβ 0 ∈R,β∈R p L(β 0, β)s.t.p∑j=1∣β j ∣ ≤ s, (5.2)where s ≥ 0. The Lagrange form (Eqn. 5.1) and the constrained form (Eqn. 5.2) are equivalentin the sense that for each s there is a penalty λ that yields the same solutions (Hastie et al.,2009a).Both the l 1 and the l 2 penalties are useful in genome-wide analysis. First, in most studiesN ≪ p, where N is the number of samples and p is the number of SNPs, therefore thestandard linear and logistic solutions are not mathematically well-defined unless penalised insome way. Second, the l 1 penalty is useful since we expect only a small fraction of total SNPsto be truly causal, rather than spuriously associated with the phenotype through linkagedisequilibriumwith the causal SNP. The penalty allows to tune the exact degree of sparsity,so different models with different degrees of sparsity can be explored. Third, the l 1 penaltyshrinks the estimated coefficients towards zero, thereby reducing the tendency of the modelsto overfit — reducing the model’s generalisation error (expected error on previously unseendatasets). When inputs are highly correlated, the l 2 penalty tends to assign weights withsimilar magnitude but opposite signs to correlated inputs, whereas the l 1 penalty tends toinduce models that select one variable out of a group of correlated variables. These differencesshould be taken into account when interpreting the selected variables, especially for SNPs, aslack of inclusion in the l 1 -penalised model does not necessarily imply lack of association withthe phenotype — the excluded SNP may have been “masked” by a highly correlated SNPthat is in the model.5.2.2. Review of Methods for Fitting Lasso ModelsGiven a convex loss function, such as the linear and logistic regression and the square-hingeloss functions, the l 1 -penalised loss (Eqn. 5.1) is convex as well (with respect to the weights β).Convex optimisation is a well-understood problem, for which many tools are available (Boydand Vandenbergh, 2004; Nocedal and Wright, 2006), each with their own strengths and weaknesses.We briefly review some of the major classes of methods for fitting l 1 -penalised models;see Bach et al. (2011) for a detailed discussion of these approaches and others.• LAR and Homotopy. In the LAR (Least Angle Regression) algorithm (Efron et al.,2004), assuming standardised (zero mean and unit variance) inputs and the residualvector r i = y i − x T i β, the weight β j for the variable x j = (x 1j , ..., x Nj ) T most correlatedwith r = (r 1 , ..., r N ) T is increased towards its unpenalised least-squares weight of x T j runtil another variable x k has at least equal correlation. Then x k is entered into themodel and the residuals recomputed. This process is repeated until all variables are in99

5. Fast and Memory-Efficient Sparse Linear Modelsthe model, achieving the unpenalised least-squares solution. Modifying LAR to excludea non-zero variable that becomes zero and then recomputing the residuals results in thelasso solution, and the series of LAR solutions is then the lasso regularisation path (thepiecewise-linear series of weights for each λ). The homotopy method (Osborne et al.,2000a,b) is similar to LAR (Hastie et al., 2009a).• Coordinate descent. Also called Shooting (Fu, 1998), later rediscovered and expandedby Daubechies et al. (2004) and Friedman et al. (2007). In coordinate descent, the lossis optimised with respect to each variable separately, while holding the other fixed.This process is repeated over all variables cyclically (in which case it is equivalent tothe Gauss-Seidel method), randomly (Shalev-Shwartz and Tewari, 2009), or in someorder (for example, order of the magnitude of the gradients), until convergence. The l 1penalty is applied to the estimates using the soft-thresholding operation. Coordinatedescent can be parallelised (Bradley et al., 2011). In addition, block coordinate descentcan be used, where minimisation is performed with respect to blocks of variables ratherthan one variable at a time. We discuss cyclical coordinate descent in more detail inSection 5.4.1.• Projected gradient. In projected gradient methods (Duchi et al., 2008), the loss functionis minimised in the original constrained form (Eqn. 5.2) rather than the Lagrangianform. The loss minimisation is performed using gradient descent, and the constraintsare imposed by projecting the weights on the feasible region where the constraints aresatisfied (the l 1 ball).• Stochastic gradient descent. Stochastic gradient descent (SGD) (Bottou and LeCun,2004; Langford et al., 2009) is a first order online algorithm, where a gradient descentstep β k+1 = β k − η∇L(β) for some small step size η over all the input variables istaken after each sample is encountered, as opposed to the other methods that are batchmethods, where an update is based on all samples. The step size is usually decreasedover time (step size decay). Sparsity can be obtained by truncating updates that crosszero. SGD has the advantage of requiring very little memory, since only one sampleneeds to be accessed at any one time. SGD has been shown to be very fast to achievegood generalisation error, since in practice an algorithm does not need to find the trueglobal minimum of the loss function in order to have good out-of-sample performance.However, SGD requires careful tuning of the step size η and the step size decay schemeto achieve good results, which may limit its widespread adoption within the genomicscommunity, compared with other methods that require less tuning.100

5.3. Design Considerations5.3. Design ConsiderationsHaving reviewed several methods for fitting lasso models, we now discuss considerations fordesigning a practical method for fitting lasso models in the context of large-scale genetic data.Speed Genetic datasets often contain hundreds of thousands of markers, and thousands ofsamples. We would like to use methods that can efficiently process large amounts of datarapidly. Data are usually not analysed once — we typically perform cross-validation, examinedifferent phenotypes (if available), fit models to subsets of the data, and so on. Therefore, it isnot realistic to use a method that requires many hours or entire days to fit one model. Whenanalysing large datasets, the bottleneck quickly becomes I/O (reading data from disk), ratherthan fitting the model. Therefore, not having to load all data into memory before fittingbegins is an advantage. In addition, speed concerns dictate the implementation language: aninterpreted language such as R or Python may be more convenient than a compiled languagesuch as C, but these languages are typically one or two orders of magnitude slower than C,with less fine-grained control over operations such as copying of data structures that can beprohibitive for large data. The use of C also permits more compact data representations.Scalability We define scalability as the ability to analyse increasingly large datasets withconstant or linearly-increasing resources. Scalability is not the same as speed. For example,any method that depends on computing covariance matrices, such as Newton’s method forminimisation, is not naïvely scalable to SNP data as storing these matrices, let alone performingoperations on them is beyond the abilities of current commodity hardware; for example,storing the triangular covariance matrix of 500,000 SNPs with one byte per genotype wouldrequire around 116GiB of RAM (covariance matrices are typically not sparse so sparse matricesare not useful here). Even for methods that do not utilise covariance matrices, suchas coordinate descent and quasi-Newton methods (Newton’s method using approximationsof the Hessian matrix), loading all data into memory may not be practical. Therefore, whenanalysing large datasets, we would like to avoid having to load all data into memory at once.In addition, tools that make as few copies of the data as possible are preferable, both forincreased speed and reduced memory usage.Tuning All l 1 -penalised methods require tuning of the λ penalty. However, some methodsrequire tuning additional parameters to achieve good results. For example, stochastic gradientdescent requires manual tuning of the step size and its decay rate. We would like to free theanalyst from concerns about the numerical properties of the fitting algorithm, such as whetherconvergence has occurred or not, and let them concentrate on analysing the data. While nomethod is completely fail-safe under all possible inputs, we and others (Friedman et al., 2010)101

5. Fast and Memory-Efficient Sparse Linear Modelshave empirically found coordinate descent to be numerically stable when analysing SNP data,consistently converging for all useful penalty ranges, without any other tuning required.5.4. MethodsBased on the design considerations above, we have designed an efficient implementation ofcyclical coordinate descent for fitting l 1 -penalised linear models to SNP data, requiring memorythat only grows linearly O(N + p), and outperforming several state-of-the-art in-memorymethods (and one out out-of-core method) when accounting for the time taken to load thedata into memory. We now describe in more detail our implementation of coordinate descentin SparSNP.5.4.1. Out-of-Core Coordinate DescentMinimising l 1 -penalised loss functions is a convex optimisation problem. However, it has, ingeneral, no analytical solutions, and must be solved numerically. We use a method based oncoordinate descent to numerically minimise the loss function. By expressing the contributionsof each variable to each samples in terms of a sum of the p variables (the linear predictor),we can use memory of order O(N + p), and only keep one input variable in working memoryat a time by reading data from disk and updating the estimates at the end of each epoch.Pseudocode for the algorithm is shown in Algorithm 1.In coordinate descent (Friedman et al., 2007, 2010; Van der Kooij, 2007), each variable isoptimised with respect to the loss function using a univariable Newton step, while holding theother variables fixed. Since the updates are univariable, computation of the first and secondderivatives is fast and simple (we assume that all of our loss functions are twice-differentiable,at least piecewise). The l 1 /l 2 penalisation is achieved using soft thresholding (Friedman etal., 2007) of the Newton step for each variable β jˆβ j ← S( ˆβ j − s j , λ), (5.3)where s j = ∂L∂β j/ ∂2 L∂β 2 jis the Newton step and S(⋅, ⋅) is the soft thresholding operatorS(α, γ) = sign(α) max{0, ∣α∣ − γ}, γ ≥ 0.For the linear loss and the squared hinge loss, one Newton step yields the exact solution withrespect to each β j (though it is not necessarily the same as the global solution, hence the needfor the active-set convergence method, see below). Other losses, such as the logistic loss, canbe solved by a quadratic approximation (2nd order Taylor expansion), in which case iterationover each β j may be required, until convergence.102

5.4. MethodsThere are two key aspects of this approach that allow efficient computation without keepingall data in working memory. First, since we are performing univariable minimisation, boththe first and second derivatives are scalars and are computed in a single pass over the samples.Second, the partial derivative wrt ˆβ j is computed efficiently since it is based on the linearpredictor l, which is the sum of the contribution of all variables to the model l i = ˆβ 0 +∑ p j=1 x ij ˆβ j , i = 1, ..., N. The linear predictor changes only for one variable at a time, and onlyif that variable changes its value. Once the estimate for the jth variable has been updated,the linear predictor is updated by subtracting the old contribution x ij ˆβj and adding the newcontribution x ij ˆβ′ j .The linear predictor form can be used whenever a linear or log-linearstatistical model is used, since then the predictor is additive in each variable’s contribution.By only storing the linear predictor and one vector of samples x 1j , ..., x Nj in memory at anygiven time, we can keep memory requirements to a minimum, allowing us to fit models todatasets far larger than available RAM.The coordinate descent algorithm is identical across all loss functions; the only differenceis the computation of the Newton step and the update to the linear predictor. For the linearloss, the first derivative with respect to the weight β j isand the second derivative is∂L∂β j=N∑i=1∂ 2 L∂β 2 j=x ij (l i − y i ), (5.4)N∑i=1For the squared hinge loss, the first derivative isand the second derivative is∂L∂β j=N∑i=1∂ 2 L∂β 2 jx 2 ij. (5.5)y i x ij (y i l i − 1)I(1 − y i l i > 0), (5.6)=N∑i=1x 2 ijI(1 − y i l i > 0), (5.7)where I(⋅) is the indicator function, which evaluates to one if its argument is true and to zerootherwise. Monomorphic (zero variance) SNPs are assigned zero weight since their first andsecond derivatives are both zero. When the input data are standardised such that each SNPhas zero-variance and unit-variancex ′ ij = x ij − ¯x jσ j,where ¯x j and σ j are the arithmetic average and standard deviation of the jth genotype,respectively, then the second derivative is ∂ 2 L/∂β 2 j≤ N − 1 for linear and squared hinge loss103

5. Fast and Memory-Efficient Sparse Linear Models(due to the I(⋅) term), with strict equality for linear loss. Therefore the Newton step can becomputed as s j = (∂L/∂β j )/(N − 1), without explicitly computing the second derivative.5.4.2. Computational EnhancementsSparSNP employs several enhancements to the basic coordinate descent method that greatlyimprove performance without affecting model fit. For large datasets, the main bottleneck isI/O (reading the data into memory), not fitting the model itself. Therefore, most significantspeed improvements can be achieved by reducing the number of times SNPs are loaded fromdisk and reducing the time taken to get the data into a usable form that can be fed into themodel fitting procedure.Active-set convergence The active set method (Friedman et al., 2007) is designed to takeadvantage of the sparsity of the weight vector β, as is commonly the case in analyses ofSNP data, where only a small fraction of the SNPs are expected to have non-zero weights.The method has are two main stages. First, we iterate over all variables, one at a time. Ifany variable j becomes zero (inactive) due to the soft-thresholding, it is excluded from thenext iteration. We then iterate over the remaining active variables. When the loss converges(Section 5.4.3), we check whether the active set has changed. If the active set does not changein two such consecutive iterations then the algorithm terminates. Otherwise, all variables areadded to the active set and iterated over as before.Warm restarts We use a warm-restart strategy (Friedman et al., 2010) whereby we runcoordinate-descent over a grid of λ penalties λ max , ..., λ min . We define the maximal penaltyλ max as the smallest λ that makes all ˆβ j = 0; maximal λ is computed by first computing theunpenalised intercept, and then evaluating the Newton step (Eqn. 5.3) for each variable j.Each weight ˆβ j is initialised to zero. Due to soft thresholding (Eqn. 5.3), each ˆβ j will remainzero if its step ∣s j ∣ ≤ λ. Therefore, the maximal λ is min j=1,...,p {∣s j ∣}. The minimal penaltyλ min is taken to be a small fraction of the maximal λ, usually 10 −2 λ max .The process proceeds to the next fit, with the results from the previous fit with λ k (includingthe vector of solutions ˆβ, the linear predictor l, and the active set) are used to initialise thealgorithm for the next fit. This strategy reduces computation time considerably, since the(k + 1)th fit can typically start from a small active set, rather with the entire set of variables.Caching The active set is typically much smaller than the total number of SNPs, and tendsto be accessed more often than other variables. Therefore, it is useful to keep the active setin memory rather than repeatedly read it from disk, which is orders of magnitude slower, asin addition to random-accessing disk being slower than random access of RAM, the data arebyte-packed and must be unpacked before use. However, all SNPs need to be accessed at104

5.4. Methodssome stage during the active set convergence process. Therefore, we employ a simple prioritycache of predetermined size. If there is room in the cache, more SNPs are read in until it isfull. Once full, we read SNPs into the cache, replacing previous SNPs only if the new SNPhas been accessed more often in previous iterations (we keep a counter for each SNP). Thisway, the active set SNPs (and other often-accessed SNPs, if there is room) tend to stay inthe cache whereas the other SNPs do not, accelerating the active set method.Since caching more SNPs reduces disk accesses, the performance of SparSNP criticallydepends on the amount of RAM allocated to the cache. Performance will increase withincreasing cache size, until the point where the entire dataset is in RAM.Pre-scaling Inputs to l 1 -penalised models are typically standardised such that each genotypehas zero-mean and unit variance, since each input variable may be on a different scale, inwhich case using the same penalty for all variables may not make sense. (In the context ofSNP data, scaling the SNPs corresponds to giving more weight to rarer variants).Pre-scaling is simply a time-saving measure to standardise the SNPs as a preprocessingstep: the coordinate descent method repeatedly iterates over the variables, and repeatedlyscaling the same inputs is wasteful. Therefore, we scale the data as a pre-processing step,and store the means and standard deviations in a file. These parameters are later loadedduring the actual fitting stage, and the precomputed values for each SNP are fetched from alookup table instead of the {0, 1, 2} genotypes. When the fitting is complete, we transformthe model weights estimated on the standardised inputs back to their original scaleˆβ j = ˆβ ∗ j /σ j , (5.8)andˆβ 0 = ˆβ ∗ 0 −p∑j=1ˆβ ∗ j µ j /σ j . (5.9)where ˆβ ∗ 0 and ˆβ ∗ j are the intercept and weights estimated on the standardised inputs, and µ jand σ j are the means and standard deviations of the jth SNP.Data representation We represent the genotypes in the minor allele dosage form {0, 1,2, NA}, where “NA” denotes missing genotypes. We use the same bytepacking scheme asPLINK, encoding 4 genotypes in one byte, greatly reducing space requirements compared to8 bytes per genotype that would be required for double precision floating point representation(the default for most numerical software). Besides space savings, byte packing leads to fasterI/O, which is the main bottleneck of our method. Note that for the fitting stage, data areused in their scaled form, which is in double precision floating point format, and must beunpacked prior to use. Fast unpacking is achieved using a pre-computed lookup table that105

5. Fast and Memory-Efficient Sparse Linear Modelsmaps bytes (interpreted as short unsigned integers in the range [0, 255]) to groups of fourgenotypes at a time.Update-on-changeAs mentioned earlier, the linear predictor l must be updated every timea weight β j changes its value. This involves iterating over one N-vector (for linear loss) orseveral N-vectors (pre-computed products of the l vector with the y vector, for squared hingeloss). We perform such updates only occur when the jth variable is active (non-zero) and theestimate ˆβ j has actually changed from the last iteration, saving un-necessary updates.5.4.3. Convergence of the AlgorithmThere are three types of convergence relevant to this algorithm. First is convergence of theNewton step for each variable j. Second is convergence of coordinate descent to the globalminimum. Third is convergence of the active set.Convergence of the Newton stepNewton’s method is exact for the linear loss and thesquared hinge loss, hence convergence of the Newton step is guaranteed and we do not checkfor it. For other losses, such as logistic loss, Newton’s method amounts to a quadratic approximationand convergence is not guaranteed, and other techniques such as line search aresometimes used (these are not implemented in SparSNP).Convergence of coordinate descentTseng (2001) showed that coordinate descent convergesto the global minimum when two conditions are met: (i) the loss function being minimisedis convex, as is the case with most common loss functions such as those used here, and (ii)the penalty is separable, that is, the penalty is a sum of functions where each function isa function of one weight β j . The l 1 is penalty is a separable loss function and coordinatedescent is therefore guaranteed to find its minimum, when used with a convex loss function.In practice, convergence is typically measured either by the absolute change in the lossbetween iterations ∣L (k) − L (k−1) ∣ ≤ ɛ or by the relative change in loss ∣L (k) − L (k−1) ∣/∣L (k) ∣ ≤ ɛ.Convergence can also be measured in each weight β j separately, namely ∣β (k)j− β (k−1)j∣ ≤ɛ or ∣β (k)j− β (k−1)j∣/∣βj k ∣ ≤ ɛ for absolute and relative convergence, respectively. SparSNPimplements the test for absolute change in loss as we found it to offer a good tradeoff betweenspeed (number of iterations until convergence) and precision in the final estimated weights.Convergence of the active setSparSNP uses an active set method to reduce computationalcost whereby we iterate over a small set of active (non-zero) variables, instead of over all pinput variables (see Section 5.4.2 for details). Convergence of the active set means that theactive set has not changed — no variables have either entered or left the set between twoiterations. While this is similar to convergence in the weights as discussed previously, active106

5.4. Methodsset convergence means that variables that were previously zero remain exactly zero and thosethat were not zero remain non-zero, whereas convergence in the weights does not guaranteethis property (unless the tolerance ɛ is zero). Therefore, we use a combination of convergencein the loss and convergence of the active set to determine convergence in SparSNP.5.4.4. Model SelectionThe λ penalty tunes the model complexity, and can be selected in several ways. The simplestway is to leave it fixed at some arbitrary value, however, this may result in suboptimalperformance if the number of selected variables is too small or too large. A second way is toprespecify the number of non-zero SNPs required, and then search for the λ penalty that producesthe required number of SNPs (Wu et al., 2009). A third way is to use cross-validation.Cross-validation may produce models with too many false positives (non-zero weights thatshould be zero) (Meinshausen and Bühlmann, 2006). Meinshausen and Bühlmann (2010)advocate using resampling to overcome this problem. However, in practice, cross-validationworks well for selecting the best model or set of models when the number of samples is largeenough, so that there are enough samples in each training fold to reasonably estimate themodel parameters, and when the class labels are not too imbalanced, as is the case withmany case/control datasets. Since the estimates of AUC derived from finding the best modelin cross-validation may be upwardly biased, an unbiased estimate of predictive performanceshould be derived from an independent test set.5.4.5. Space ComplexityAt any given time, we need to store in memory the following data:• The cache, representing k vectors of samples x ij for i = 1, ..., N (where k is the cachesize in terms of N-vectors);• One vector of the linear predictor l i for i = 1, ..., N;• One vector of the coefficients ˆβ j for j = 0, ..., p;• For loss functions other than the squared loss, several auxiliary vectors representingtransformations of the linear predictor, each of length N;• Two vectors representing whether each variable is/was active, of length p + 1.In total, the memory requirements are O(N + p).107

5. Fast and Memory-Efficient Sparse Linear Modelsactive j ← active ′ j ← Trueallconverged ← 0for j = 0, ..., pfor k = 1, ..., kmax dofor j = 0, ..., p doif active j thenread x j from disks j ← ∂L∂β j/ ∂2 L∂βj2ˆβ j k k−1S( ˆβ j − s j , λ) if j > 0← {ˆβ j k−1 − s j otherwise.∆ j ← ˆβ j k k−1− ˆβ jl ← UpdateLP(l, x j , ∆ j )endactive j ← ˆβ j k ≠ 0end// Check convergence in loss Lif ∣L (k) − L (k−1) ∣ ≤ ɛ thenallconverged ← allconverged + 1elseallconverged ← 0end// Active set has converged onceif allconverged = 1 thenactive ′ j ← active j for j = 0, ..., pelse// Active set hasn’t changed in two consecutive epochs, terminateif AllEqual(active, active ′ ) thenbreakelse// Active set has changed from last convergence,// reset active set and repeat for another epochactive ′ j ← active j for j = 0, ..., pallconverged ← 1endendendAlgorithm 1: The coordinate descent algorithm, showing the active set method. Thefunction AllEqual(⋅, ⋅) returns True when both input vectors are identical and Falseotherwise, All(⋅) returns True when all elements of the input vector evaluate to True,and I(⋅) is the indicator function. The variable β 0 is the intercept. kmax is themaximal number of epochs (a user-determined parameter). For the linear loss case,UpdateLP(l, x j , ∆ j ) ∶= l i + ∆ j x ij for i = 1, ..., N.108

5.4. Methods5.4.6. Discrimination and Explained Phenotypic VarianceWe measure discrimination of a classifier using the Area Under the ROC Curve (AUC orAROC) (Hanley and McNeil, 1982), defined asÂUC =N1+N + N − ∑N −∑ [I(ŷ i > ŷ j ) + 1 2 I(ŷ i = ŷ j )] , (5.10)i=1 j=1where N + + N − = N are the number of positive and negative labels, respectively; ŷ i is theprediction for the ith sample, and I(⋅) is the indicator function, I(x) = 1 when x is trueand 0 otherwise. The sample AUC has the probabilistic interpretation as the (estimated)probability of correctly ranking two randomly chosen samples in the correct order (i.e., shorttermsurvival before long-term survival), plus a correction for ties. AUC = 0.5 is equivalent torandom ranking, whereas AUC = 1 and AUC = 0 correspond to perfect and perfectly-wrongranking, respectively.Unlike the error rate (or, conversely, the accuracy), AUC does notdepend on the class balance of the dataset, hence it can be meaningfully compared acrossdifferent datasets.SparSNP estimates the proportion of phenotypic or genetic variance explained by a givengenetic model. The phenotypic variance is the variation observed in the phenotype in the data,and it may be due to environmental as well as genetic factors, whereas genetic variance is thevariation due solely to genetic factors, also called the heritability of the phenotype, which istypically estimated from twin or pedigree studies where the confounding environmental effectscan be minimised. Hence, the explained phenotypic variance is the variance in the phenotypein the data explained by the model, and the explained genetic variance is the proportion of theheritability that can be explained by the model. In practical terms, the higher the proportionof explained phenotypic variance, the better the model is at explaining the data. However, ifthe explained genetic variance is low, then the model is likely not capturing all of the geneticvariation that is affecting the phenotype. The details of the derivation are given in Wray etal. (2010); for convenience we repeat the main method here, following their notation.The explained phenotypic variance h 2 L[x]is on the liability scale whereas AUC is on the 0–1scale. On the liability scale, we assume that the underlying liability P (roughly interpretedas risk of disease), is distributed according to the standard normal distribution:P ∼ N (0, 1),and that the threshold T separates those patients without the disease (liability P < T ) fromthose with the disease (liability P > T ).This model is also called the probit model (theGaussian counterpart to the binomial logit model used in logistic regression). T is determinedfrom the observed population prevalence K. Since the proportion of patients with liabilityP > T is the prevalence, then T = Φ −1 (1 − K), where Φ −1 (⋅) is the inverse standard normal109

5. Fast and Memory-Efficient Sparse Linear Modelscumulative density function (cdf).The explained phenotypic variance h 2 L[x]is estimated from the AUC and the prevalenceK ash 2 L[x] = 2Q2 / [(v − i) 2 + Q 2 i(i − T ) + v(v − T )] , (5.11)wherei = φ(T )/K,v = −iK/(1 − K),and Q = Φ −1 (ÂUC), where Φ−1 (⋅) is the inverse standard normal cdf and φ(⋅) is the standardnormal density function (pdf).The proportion of explained genetic variance ρ 2 GGis thenρ 2 GG = h 2 L[x] /h2 L, (5.12)where h 2 Lis the (narrow-sense) heritability of the disease on the liability scale, which musthave been estimated beforehand.5.4.7. Technical LimitationsMissing data For convenience, SparSNP implements random imputation for missing genotypes,where missing genotypes are randomly replaced with a genotype {0, 1, 2} (with probability1/3 each), repeatedly on each access. When the proportion of missingness is smalland the genotypes are missing at random (for example, no differential missingness betweencases and control), such a simple approach does not substantially affect the predictive abilityand does not introduce significant spurious associations. However, when missingness is highor differentiated between cases and controls, spurious associations can arise and it is crucialto either use PLINK to filter SNPs and samples with high missingness, or alternatively, toimpute the missing data using a more sophisticated method such as BEAGLE (Browning andBrowning, 2007), IMPUTE (Howie et al., 2009), or MACH (Li et al., 2010).Confounding effects SparSNP does not account for possible batch effects, which mustbe accounted for at the quality control stage. Nor does SparSNP currently account forconfounders such as population stratification, admixing, or cryptic relatedness; EIGEN-STRAT (Price et al., 2006) and PLINK can be used to detect these and to filter the dataaccordingly.Applying models to new data SparSNP produces text files containing the model weightsfor each SNP, and can be used in prediction mode to read these weights, together with anotherBED file, to produce predictions for other datasets. Model weights are with respect to theminor allele dosage for the training data, and the reference allele may be different in anotherdataset, possibly resulting in reversal in the sign of the SNP effect. In addition, both the110

5.5. Resultsdiscovery and validation datasets must contain the same SNPs in the same ordering (markernames are not important). We recommend using PLINK to ensure that both the discoveryand validation datasets contain the same SNPs and are encoded using the same referencealleles.5.5. ResultsTo assess the performance of SparSNP and compare it with existing methods, we used aceliac disease case/control dataset Dubois et al. (2010), consisting of N = 11,940 samplesfrom five European populations (Italian, Finnish, two British, and Dutch), with p = 516,504autosomal SNPs. The data processing and quality control were described in the originalpublication.We used two different experimental setups to compare SparSNP with four other state-ofthe-artmethods, one setup for timing comparison and another for predictive comparison viacross-validation. For the timing comparison, we timed the process of fitting the model (overa grid of hyperparameters for glmnet and SparSNP, one hyperparameter for the rest). Forthe predictive comparison, we used cross-validation over a grid of hyperparameters.We compared the following methods:• SparSNP 0.87 1 , with l 1 -penalised squared hinge loss;• glmnet 1.7 (Friedman et al., 2010) 2 , with logistic loss (binomial family), running underR 2.12.2 (R Development Core Team, 2011);• liblinear 1.8 (Fan et al., 2008) 3 , with l 1 -regularised l 2 -loss support vector classification(model 5, equivalent to the l 1 -penalised squared hinge loss used by SparSNP);• liblinear-cdblock (Yu et al., 2010) 4 with (non-sparse) block l 2 support vector machine(model 1);• hyperlasso (Hoggart et al., 2008) 5 , with the double exponential (DE) prior (equivalentto lasso).All models used the minor allele dosage {0, 1, 2} as input.111

5. Fast and Memory-Efficient Sparse Linear Models50,000 SNPs250,000 SNPs500,000 SNPs●●Time (sec)200000150000100000Time (sec)15000100005000●●● ● ● ● ● ●2000 4000 6000 8000 10000Number of samples●●●●Method● glmnet● HyperLassoLL−CD−L2LL−L1SparSNP50000●●●●● ● ●● ● ●●●●2000 4000 6000 8000 100002000 4000 6000 8000 10000Number of samples(a)2000 4000 6000 8000 1000050,000 SNPs250,000 SNPs500,000 SNPs3500Time (sec)30002500200015001000Time (sec)30025020015010050● ●● ●●●2000 4000 6000 8000 10000Number of samplesMethod● glmnetLL−CD−L2LL−L1SparSNP500● ● ● ● ● ●●●2000 4000 6000 8000 100002000 4000 6000 8000 10000Number of samples(b)2000 4000 6000 8000 10000Figure 5.1.: Time (in seconds) for model fitting, over sub-samples of the entire celiac diseasedataset, taken as the minimum over 10 independent runs. (a) For all methodsincluding hyperlasso. (b) Excluding hyperlasso. For in-memory methods weincluded the time to read the binary data into R. For SparSNP and glmnet weused a λ grid of size 20, and a maximum model size of 2048 SNPs. liblinearused C = 1. hyperlasso used one iteration with λ = 1 (DE prior). The leftpanel includes all four methods, the right panel excludes hyperlasso. Theinsets show the leftmost panel (50,000 SNPs) on its own scale to better visualisethe differences.112

5.5. Results5.5.1. SparSNP makes possible rapid, low-memory analysis of massive SNPdatasetsSparSNP consistently outperformed the other methods when fitting models to the data(Figure 5.1). We ran all methods on random subsets of the celiac disease dataset, consistingof randomly selected subsets of the data with p = {50,000, 250,000, 500,000} SNPs andN = {1000, 5000, 10,000} samples, a total of nine subsets. This process was independentlyrepeated 10 times. Only SparSNP and Liblinear-Cdblock could fit models to datasetswith > 5000 samples and 250,000 SNPs, and they were the only tools that could fit modelsto > 1000 samples and 500,000 SNPs on a machine with 32GiB RAM (running on a singleCPU). It is important to note that the aforementioned data sizes would be considered quitesmall by current standards. Also note that in contrast with SparSNP, liblinear-cdblockdoes not implement an l 1 -penalised model but a standard l 2 -penalised support vector machine(SVM), which is not a sparse model, and does not produce solutions over a grid of model sizes;instead, a computationally expensive scheme such as recursive feature elimination (Guyon etal., 2002) would be required in order to find sparse models, but we did not use RFE here.Of the remaining methods, liblinear and glmnet did not complete all experiments due torunning out of memory (on a 32GiB RAM machine) or due to the data exceeding the limiton matrix sizes in R (a maximum of 2 31 − 1 elements). hyperlasso took much longer tocomplete: ∼2 hours for the 1000 sample/500,000 SNP subset and ∼69 hours for the 10,000sample/500,000 SNP subset.We emphasise that these results are for one run over the data. In practice, cross-validationis used to guide model selection and evaluate the generalisation error of a model. Run timesfor cross-validation would be higher yet: 3-fold cross-validation repeated 10 times would takeapproximately 20 times longer, ∼22 and ∼4 hours for liblinear-cdblock and SparSNP,respectively, over the largest subset — making the differences in speed even more important.Also note the difference in the number of models fitted: both SparSNP and glmnet use awarm restart strategy, computing a separate model for each penalty in a grid of 20 penalties,resulting in a path of 20 separate models with different sizes, whereas liblinear/liblinearcdblockand hyperlasso computed only one model based on one penalty — exploring agrid of penalties would be costlier still in terms of time.SparSNP is implemented in C, glmnet is mainly implemented in Fortran, and liblinear,liblinear-cdblock, and hyperlasso are implemented in C++. Therefore, we assume thatthe implementation language is not a large factor in the speed differences.1 http://www.genomics.csse.unimelb.edu.au/SparSNP2 http://cran.r-project.org/web/packages/glmnet3 http://www.csie.ntu.edu.tw/~cjlin/liblinear4 http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/cdblock5 http://www.ebi.ac.uk/projects/BARGEN/download/HyperLasso113

5. Fast and Memory-Efficient Sparse Linear ModelsAUCVarExpvalue0.80.60.4MethodglmnetHyperLassoLL−CD−L2LL−L1SparSNP0.21 2 4 8 16 32 64 128 256 512 1024Number of SNPs in model1 2 4 8 16 32 64 128 256 512 1024Figure 5.2.: Left: LOESS-smoothed AUC and explained phenotypic variance (denoted “Var-Exp”) for the Finnish celiac disease dataset, for increasing model sizes. Forliblinear-cdblock (LL-CD-L2), all 516,504 SNPs are included in the model.AUC is estimated over 30 × 3-fold cross-validation. The explained phenotypicvariance is estimated from the AUC using the method of Wray et al. (2010),assuming a population prevalence of K = 1%.5.5.2. SparSNP produces models of better or comparable predictive abilityWe used the Finnish subset of the celiac disease dataset (N = 2476 samples, p = 516,504SNPs) to evaluate predictive performance of the models in 3-fold cross-validation. Overall,the Finnish dataset had low missingness (no samples or SNPs with missingess ≥ 5%). Toenable glmnet and liblinear to run on data of this size, we used a machine with 48GiBRAM. We measured predictive ability with the area under the receiver operating characteristiccurve (AUC) (Hanley and McNeil, 1982). From the AUC we also estimated the explainedproportion of phenotypic variance (Wray et al., 2010), assuming a population prevalence forceliac disease of K = 1%. We did not evaluate the predictive ability over the entire celiacdataset, as it consists of several populations of different ethnic background (British, Italian,Finnish, and Dutch), and the case/control status may be confounded by effects such aspopulation stratification.SparSNP induced models with AUC of up to 0.9 and explained phenotypic variance of upto ∼ 40% (Figure 5.2), almost identical to that of glmnet, except for small differences at theextremes of the λ path; the differences may be due to the fact that SparSNP and glmnetuse different loss functions and have different parameters such as convergence tolerances.liblinear showed similar maximum AUC as the other methods, but much lower AUC forsmaller number of SNPs in the model. liblinear-cdblock showed consistently lower AUCover the range of costs used: a grid of 18 costs C = 10 −4 , ..., 10 3 . Varying the costs did notsubstantially change the AUC (maximum changes of < 0.01 in AUC), therefore we show the114

5.6. Software Featuresresults averaged over all costs. Since liblinear-cdblock uses an l 2 -SVM, which does notinduce sparse models and does not natively produce a range of model sizes, we show resultsfor a model with all 516,504 SNPs.Due to the high computational cost of running hyperlasso, we were not able to run ascomprehensive a grid search; therefore, we performed only two replications of 3-fold crossvalidation,using the DE prior with parameter λ = {10, 20, 30, 40, 50, 60} over 10 posteriormodes, and averaged the AUC over the modes.Importantly, while SparSNP achieved AUC better than or comparable to approaches, theresources consumed were far from being equal: SparSNP performed 3-fold cross-validationusing a total of about 1 GiB of RAM, whereas liblinear required about 24GiB, and glmnetused up to 27GiB (the total number of samples used in the cross-validation training phaseis ∼1650, or 2/3 of the total Finnish subset). Both liblinear-cdblock and hyperlassoused low amounts of memory. liblinear-cdblock used about 210MiB of memory (using50 disk-based blocks). hyperlasso used a maximum of only 2GiB (roughly the size of thetraining data), however, it was by far the slowest.5.6. Software FeaturesAn overview of the SparSNP analysis pipeline is shown in Figure 5.3. The software implementationof SparSNP includes the following features:• implementation of l 1 -penalised linear regression for continuous traits and l 1 -penalisedclassification for binary traits;• speed — SparSNP fits models to data with 10 4 samples and 5 × 10 5 SNPs in < 10minutes on a single CPU;• small (and tunable) amounts of memory are required: ∼ 1GiB for the datasets analysedhere;• compatibility with PLINK BED (SNP-major ordering) and FAM files (single phenotype)(Purcell et al., 2007);• cross-validation is performed natively, removing the need to manually split datasets;• produces a set of models with increasing numbers of SNPs in each model, allowing formodel selection based on cross-validated predictive performance;• calculates the area under receiver-operating-characteristic curves (AUC) and explainedphenotypic or genetic variance, in cross-validation or on replication datasets.115

5. Fast and Memory-Efficient Sparse Linear ModelsAcquire data∗ Discovery dataset and validationdataset (if available)∗ IMPUTE/MACH/BeagleImputationQuality control∗ Filter SNPs by posterior probabilitycalls, missingness, MAF, HWE∗ Filter samples by missingness∗ Test for differential missingness∗ Test for other confounders (populationstratification, two-locus test)SparSNPdiscovery∗ Cross-validation on discovery data∗ Plot AUC and variance explained∗ Choose a good model/sValidation∗ Check reference allele agreementbetween discovery and validationdatasets∗ Map discovery SNPs to validationSNPs if on different platforms∗ Predict on validation data∗ Compute AUC and variance explainedin validation dataFigure 5.3.: An example pipeline for analysing a SNP discovery dataset with SparSNP andtesting the model on a validation dataset. Most of the data preparation andprocessing can be done with PLINK.116

5.7. Discussion5.7. DiscussionWe have introduced our tool SparSNP, for fitting lasso-penalised linear models to large SNPdatasets. In experiments using a celiac disease dataset, we have shown that SparSNP isboth faster than four other state-of-the-art methods, while using low amount of memory, andachieves comparable or better predictive ability. The main bottleneck in the analysis is thelarge amounts of RAM required to fit models, which may not be feasible or accessible to manyusers. SparSNP incorporates multiple computational strategies to minimise the amount ofRAM required. Even when such memory is available, the time taken to read the data fromdisk becomes the bottleneck, rather than the fitting process itself. Thus, the time takento analyse the data may be long enough to preclude a comprehensive analysis of the data,such as multiple rounds of cross-validation or experimenting with various model parameters.SparSNP makes it possible to rapidly analyse such datasets — 10 replications of 3-foldcross-validation of a 10,000-sample/500,000 SNP dataset can be performed in about 2 hours,requiring only ∼1GiB RAM. This time can be further reduced by running multiple instancesin parallel on a compute cluster. While the celiac disease dataset analysed here is quite large,recent genome-wide studies are larger still, involving 1–6 million SNPs, either by direct assayor by imputation from HapMap (International HapMap 3 Consortium, 2010; InternationalHapMap Consortium, 2007) or 1000Genomes (1000 Genomes Project Consortium, 2010). Thenumber of samples in current datasets is larger as well, and likely to continue growing into thehundreds of thousands. For such studies, fitting multivariable models using current methodsis not feasible with standard tools. SparSNP is scalable in terms of memory requirements,and yet is faster than comparable approaches, making it suitable for analysing such datasets.All of the l 1 -penalised methods considered here induced models with consistently betterpredictive ability than the l 2 -penalised SVM implemented in liblinear-cdblock. Thisindicates that the l 1 -penalised approach is both preferable in terms of prediction and interms of interpretability, as many model weights are set to zero in contrast to the l 2 methodswhere typically none of the weights are exactly zero, and additional postprocessing is requiredto extract the subset of important variables. The success of l 1 methods may also suggest thatthe underlying genetic architecture of celiac disease is indeed sparse — very few of the assayedSNPs are strongly associated with the phenotype. Nonetheless, we cannot assume that thestrongly associated SNPs are truly causal, as they are mostly tag SNPs that are in LD withthe causal SNPs. Better detection of causal SNPs may be achieved using fine-mapping datasuch as the Immunochip for immune-related diseases (Trynka et al., 2011).There are several ways in which the basic SparSNP approach can be expanded. First,the genetic models implemented in SparSNP could be further expanded to include dominantand recessive models. This could be implemented as extra variables, in addition to theadditive coding variables, such that each SNP would be represented by three feature vectorsthat could be selected by the lasso independently, based on their contribution to the model.117

5. Fast and Memory-Efficient Sparse Linear ModelsAnother computationally cheaper scheme would be to allow these extended models to applyonly when a SNP already has an non-zero weight in the additive model. Second, the simpleimputation implemented in SparSNP is for convenience only, when missingness is sufficientlylow. Sophisticated imputation methods would necessarily come at the cost of computationalefficieny. Imputing by allele frequency may be a good compromise as it is less biased thenimputing by fixed probabilities but is computationally cheap. Third, other variables such asclinical variables (sex, age) and population structure variables (principal components) canconceptually be added to the model, thus allowing adjustment for potential confounders, andmaking SparSNP more useful for practical analysis where such confounding is common.118

6Sparse Linear Models Explain PhenotypicVariation and Predict Risk of Complex Disease6.1. IntroductionIn Chapter 4 we examined the topic of predicting the phenotype of breast cancer relapsegene expression data. In this chapter, we again consider the task of phenotype prediction,but from genetic data, In contrast with gene expression that is dynamic, genetic variationsare fixed for the life of the individual. In addition, typical genetic datasets are much largerthan most gene expression datasets, consisting of thousands of individuals and hundreds ofthousands to millions of SNPs (single nucleotide polymorphisms), compared with typicalgene expression datasets that have several hundred individuals and up to tens of thousands ofprobes. Therefore, we apply the sparse linear models discussed in Chapter 5, which are suitedfor fitting models to data of this scale. The value of predictive models is threefold. First,good predictive models of disease may enable better diagnostic tools for detecting individualsat higher risk of disease, enabling early intervention or treatment. Second, analysing thepredictive ability of the SNPs allows us to better quantify the genetic component of disease,relative to other factors, such as the environment. Third, characterising the most predictiveSNPs provides information about potential causal mechanisms of disease and about geneticregulation in general.To maximise predictive value and identify causal SNPs, all SNPs should be modelled simultaneouslyin a multivariable model. We present a comprehensive analysis of simulated and119

6. Sparse Linear Models Explain Phenotypic Variation and Predict Risk of Complex Diseasereal data using lasso penalised multivariable models. In simulation, our multivariable modelsachieved lower false-positive rates than univariable methods for detecting causal SNPs. Usinggenome-wide SNP profiles for 32,000 individuals across eight complex diseases, we found thatour models accurately discriminated cases from controls in celiac disease and type 1 diabetes.For these diseases, the models strongly replicated across independent datasets with validationArea Under receiver operating characteristic Curve (AUC) of 0.84 for type 1 diabetes and0.82–0.9 for celiac disease, the latter across four independent datasets of different Europeanethnicities. Consequently, the models of celiac disease and type 1 diabetes explained substantialphenotypic variance in independent validation: 22% for type 1 diabetes and 21–38% forceliac disease. Investigation of type 1 diabetes and celiac disease substructure revealed highlypredictive subtypes which achieve ≥99% specificity and in some cases positive predictive values≥0.80. Taken together, this study shows supervised learning approaches can address missingphenotypic variance and reliably predict incidence of celiac disease and type 1 diabetes fromgenotype.In this chapter, we aim to comprehensively assess the performance of lasso-penalised modelsin SNP association analysis and to investigate their implications in the population context.First, in simulation, we investigate how well the lasso recovers true causal SNPs, as comparedwith univariable testing. In contrast to many existing simulation setups, we argue for the useof precision rather than sensitivity (power) in measuring detection ability. Second, we applylasso models to two celiac disease datasets and seven Wellcome Trust Case-Control Consortium(WTCCC) (The Wellcome Trust Case Control Consortium, 2007) datasets (bipolardisorder, coronary artery disease, Crohn’s disease, hypertension, rheumatoid arthritis, type 1and type 2 diabetes). We evaluate the predictive ability of these models in cross-validation,and for celiac disease and type 1 diabetes also in independent validation datasets. We furtherexamine the positive and negative predictive values produced by these models, taking intoaccount their population prevalence, and finally identify subgroups of celiac disease and type1 diabetes cases that can be predicted with high confidence, potentially indicating previouslyunknowndisease substructure.6.2. MethodsWe now describe the statistical models we use to model the association between genotypesand the phenotype.6.2.1. Genetic ModelsWe considered three methods for fitting statistical models to the SNP data: l 1 -penalisedsquared hinge loss, a two-stage logistic regression, and GCTA (Yang et al., 2011).120

6.2. MethodsLasso ModelsWe used l 1 -penalised squared hinge loss models, implemented the package SparSNP (as describedin Chapter 5), over a grid of 20 penalties, to induce models with increasing numbersof SNPs in the model. All models were linear in minor allele dosage {0, 1, 2}. All modelswere evaluated using cross-validation, and when a validation dataset was available, also onthe validation dataset. For the validation of models, we selected the number of SNPs inthe model that yielded the highest cross-validated AUC (see Section B.1) on the discoverydataset. We then applied these models without any further modification or tuning to theindependent validation dataset, to derive an unbiased estimate of the models’ AUC.Logistic RegressionFor the logistic regression, we first use univariable logistic regression on each SNP separately,yielding p separate modelslogit(p (j)i) = β (j)0+ x (j)iβ (j)1, j = 1, ..., p,wherelogit(p) = log(p/(1 − p)),p (j)iis the probability of disease for the ith sample based on the jth SNP, x (j)iis the ithgenotype for the jth SNP, and β (j)0and β (j)1are the intercept and regression coefficient forthe jth SNP, respectively. The logistic model is fitted using a variant of iteratively-reweightedleast squares (IRLS, equivalent to Newton’s method) (Hastie et al., 2009a).The p-valuefor the association is derived from the z-statistic for each coefficient (Wald’s test (Agresti,2002)), where the z-statistic itself is derived from inverting the Hessian matrix evaluated atthe maximum likelihood solution.In the second stage, we filter the SNPs based on theirp-value, and fit a multivariable logistic model to all k remaining SNPslogit(p i ) = β 0 ′ + ∑ x ′ jβ j,′where x ′ i ∈ Rk represents the ith vector of genotypes for the k SNPs that remained afterthe filtering, and correspondingly β ′ ∈ R k is the k-vector of model weights, and β 0 ′ is theintercept. Prior to model fitting, we further filter the SNPs based on their correlation r 2 < 0.8(a generally accepted threshold for high LD, see for example Carlson et al. (2004); Hinds etal. (2005)), in order to reduce the effects of multicollinearity, as IRLS may fail for highlycorrelated inputs (due to a singular Hessian matrix). Finally, we use the multivariable modelpj=1121

6. Sparse Linear Models Explain Phenotypic Variation and Predict Risk of Complex Diseaseto predict the probability of case status from the genotype and the estimated model weights1Pr(y i = 1∣x i ) =1 + exp (−β0 ′ − ∑p j=1 x′ ij β′ j ).GCTABriefly, Genome-wide Complex Trait Analysis (GCTA) (Yang et al., 2011) implements amixed-effect model (Gelman and Hill, 2007), where the SNPs are considered random effectsand other variables like sex and age are fixed effects. Fixed effects are those effects that areconstant among samples with the same value of the variable. For example, using sex as afixed effect in a regression of height on sex (and excluding any other effects), all males havethe same predicted height. In contrast, random effects are effects that randomly vary betweensamples with the same realisation of the variable. For example, when regressing height onthe city in which a person lives, and using the city as a random effect, we allow for heightsto vary within each city, but they may still be more correlated within each city than betweencities, depending on the model. Modelling the city as a random effect allows us to accountfor such correlations in a generic way, without having to fit a specific regression term to eachspecific city that happened to occur in the sample. Typically, discrete variables modelled asfixed effects are those that all their possible values can be measured in the study (for example,male/female for sex), whereas variables are more suitable to be considered a random effect ifthey represent a sample of all possible values this variable can take (for example, a sample ofcities out of all possible cities in Australia).For a continuous outcome y, GCTA implements the mixed-effect modelpy i = β 0 + ∑ x ij β j + g i + ɛ i , i = 1, ..., N, (6.1)j=1where y ∈ R N are the phenotypes, β ∈ R p are weights for variables such as sex, age, andother clinical measurements of interest, and g ∈ R N is a vector of normally-distributed geneticeffects with g ∼ N (0, Aσg), 2 var(y) = Aσg 2 + Iσɛ 2 , where A is the N × N genetic relationshipmatrix (GRM) between individuals, I is the N × N identity matrix, and σg 2 and σɛ 2 are thevariance explained by the SNPs and the variance of the residuals, respectively. The GRMis estimated from the SNP data. For binary outcomes (such as case/control data), a similarmodel to (6.1) is used, except that it is linear in the log odds (inverse logit of y). Fromthe GCTA mixed-model we then estimate the risk of each sample being a case based on itsgenotype (the best linear unbiased predictor, BLUP), and this score is then used to rank thesamples in the subsequent analysis.122

6.2. Methods6.2.2. HAPGEN2 simulationsWe used HAPGEN (Su et al., 2011) v2.2.0 to generate simulated case/control data with linkagedisequilibrium (LD) patterns based on haplotype and legend data from HapMap3 (InternationalHapMap Consortium, 2005, 2007) release 22 CEU data of chromosome 10 (73,832SNPs). In order to reduce memory requirements, we split the chromosome into 148 blocks ofup to 500 SNPs each, randomly selected one SNP from each block as causal, and combinedthe blocks together to form a complete chromosome. We used a multiplicative model of risk,where each causal SNP was randomly assigned one risk ratio per dose out of {1.1, 1.2, 1.3, 1.4,1.5, 2.0}. For example, with a risk ratio per dose of 1.1, the risk ratios for the homozygousgenotype (two protective alleles), heterozygous genotype, and homozygous genotype (two riskalleles) are 1.0, 1.1, and 1.21, respectively. We conducted eight sets of simulations, each usinga different number of samples N = {100, 500, 2500, 5000, 10,000, 20,000, 50,000, 100,000},where the number of cases and controls is balanced (N/2 for each class). In considering theprediction of each SNP as causal/non-causal, only the 148 causal SNPs were taken to be astrue positives, with the rest considered false positives.We fit l 1 -penalised squared-hinge loss models to each dataset, over a grid of l 1 penalties.For each penalty, the absolute value of estimated SNPs weights β j was thresholded at differentcutoffs to decide which SNP was causal (above cutoff) or non-causal (below cutoff). As abaseline for comparison, we used univariable genotypic (allele dosage model) logistic regression(one SNP at a time); we used the negative log 10 of the Wald-test p-value to rank the SNPsfrom most likely to be associated to the least likely. We also used the estimated regressioncoefficient from the logistic regression and the log 10 of the p-value from the 1-df allelic test,with very similar results to the logistic test (results not shown).Note that the univariable logistic model was used only to detect SNPs with significant associationswith the case/control phenotype, and therefore employed in the simulations for assessingcausal SNP recovery. The multivariable logistic model was used to create case/controlpredictive models for the WTCCC and celiac disease data, based on the SNPs identified bythe univariable method, and was not assessed in the simulations.6.2.3. Positive and negative predictive valuesThe positive and negative predictive values (PPV and NPV) (Altman and Bland, 1994) of amodel are estimated asPPV =sens × prevsens × prev + (1 − spec) × (1 − prev)andNPV =spec × (1 − prev)spec × (1 − prev) + sens × (1 − prev)123

6. Sparse Linear Models Explain Phenotypic Variation and Predict Risk of Complex Diseasewhere “sens” is sensitivity = TP/(TP + FN) , “spec” is specificity = TN / (FP + TN), TPare the true positives, FN are the false negatives, FP are the false positives, and TN are thetrue negatives, and “prev” is prevalence in the population in the range [0, 1].We estimated the PPV and NPV in cross-validation, where the confusion matrix (TP, FP,TN, FN) is derived from the predicted case/control class and the actual class in the test data.The prediction of case/control is a binarisation of the classifiers continuous output—the linearpredictor l i = ˆβ 0 +∑ p j=1 x ij ˆβ j , where x ij is the ith observation for the jth SNP in the test data,ˆβ j is the estimated coefficient for the jth SNP, and ˆβ 0 is the intercept. Samples with a linearpredictor score above a given cutoff are classified as cases, whereas those below are classifiedas controls. By varying the cutoffs, different pairs of ⟨sensitivity, specificity⟩ are achieved,thus inducing a curve of the ⟨PPV, NPV⟩ pairs. For this reason, the range of PPV and NPVis limited by the data — the discreteness of finite-size samples prevents very low or very highPPV and NPV values. We averaged the PPV in small bins of NPV in order to reduce thevariability in estimated PPV, over the cross-validation replications.6.2.4. Genomic Inflation FactorThe genomic inflation factor (Devlin and Roeder, 1999) for a given SNP j is defined asλ (j) = Y 2 A(j) /Y 2(j) , (6.2)whereY 2(j) = N[N(r 1 + 2r 2 ) − R(n 1 + 2n 2 )] 2R(N − R)[N(n 1 + 4n 2 ) − (n 1 + 2n 2 ) 2 ]is the test statistic for Cochran-Armitage trend test andY 2 A(j) = 2N[2N(r 1 + 2r 2 ) − 2R(n 1 + 2n 2 ] 2(2R)2(N − R)[2N(n 1 + 2n 2 ) − (n 1 + 2n 2 ) 2 ](6.3)(6.4)is the χ 2 allelic test for association with the phenotype.The overall inflation factor λ is typically estimated from the p per-SNP factors asλ = median(λ (1) , ..., λ (p) ). (6.5)A value of λ > 1 indicates the presence of non-random mating in the population representedin the data, which can be causes by population stratification (samples of multiple ethnicbackgrounds) or cryptic relatedness (unaccounted familial relationships), both of which canconfound typical case/control analyses.124

6.3. Results6.2.5. Data and quality controlThe Bipolar Disorder (BD), Coronary Artery Disease (CAD), Crohn’s Disease / IrritableBowel Syndrome (Crohn’s), Hypertension (HT), Rheumatoid Arthritis (RA), Type 1 Diabetes(WTCCC-T1D), and Type 2 Diabetes (T2D) datasets were obtained from the WTCCC (TheWellcome Trust Case Control Consortium, 2007), in addition to the 1958 birth cohort dataset(58C) and the National Blood Service dataset (NBS) which served as shared controls. Weused the default Chiamo genotype calls generated by the WTCCC. We removed samplesthat were excluded by WTCCC due to being highly related or duplicated, SNPs that werein the WTCCC exclusion list, and SNPs that were visually identified by WTCCC as havingbad cluster plots. There were 459,012 remaining autosomal SNPs in each dataset (Table6.1). We obtained the GoKinD-T1D case-only dataset from NIH dbGaP (accessionno. phs000018.v1.p1). For the GoKinD-T1D dataset, we removed A/T and G/C SNPs tominimise strand mismatches between that dataset and the WTCCC datasets, and removedSNPs with MAF < 0.05, missing observations > 0.01, and Hardy-Weinberg equilibrium p-value < 10 −6 . Samples were removed if they had missing phenotypes, or missing in > 0.01 ofthe genotypes, leaving a total of 1604 cases over 265,023 autosomal SNPs. The GoKinD-T1Ddataset was matched with the NBS control dataset to form a complete case/control dataset.We used two versions of the each of two celiac disease datasets Celiac1 (van Heel et al.,2007) and the UK subset of the Celiac2 (Dubois et al., 2010) dataset (Celiac2-UK). Thefirst version was the original data as published, with 301,689 and 516,504 autosomal SNPs,respectively. The second was a stringently-filtered dataset, in which SNPs were removed ifthey had MAF ≤ 0.01, missingness ≥ 0.05, deviation from Hardy-Weinberg equilibrium incontrols p ≤ 0.05, differential missingness between cases and controls p ≤ 0.05, and two-locustest (Lee et al., 2010) p ≤ 0.05. Samples were removed if they had missingness ≥ 0.01, andboth samples in each pair with identity-by-descent (IBD) ˆπ ≥ 0.05 (a low level of relatedness,about half as related as first cousins, see for example(Browning and Browning, 2011; Lee etal., 2011)). We removed both samples rather than one for the sake of consistency with (Leeet al., 2010). For the stringently-filtered datasets, there were 2109 samples and 279,312autosomal SNPs for the Celiac1 dataset and 6613 samples and 471,191 autosomal SNPs forthe Celiac2-UK dataset. Similarly, we also used a stringently filtered version of WTCCC-T1Dthat underwent the same filtering, leaving 4901 samples and 370,280 SNPs.6.3. ResultsWe compared the lasso squared-hinge loss model with the logistic regression in simulation,and with logistic regression and GCTA in analysis of the nine complex disease datasets. Themain results are shown in this chapter, other supporting results are included in Appendix B.125

6. Sparse Linear Models Explain Phenotypic Variation and Predict Risk of Complex Disease6.3.1. Recovery of Causal SNPs in SimulationTo assess squared-hinge loss lasso model performance, we used HAPGEN2 (Marchini et al.,2007; Su et al., 2011) to simulate various case/control genotype datasets where the causalSNPs were known (see 6.2.2). Unlike some other published simulations (Ayers and Cordell,2010), we define a true positive only as detecting one of the causal SNPs specified by HAP-GEN2; any other SNP is taken to be a false positive, regardless of LD or distance. Withreal SNP arrays the concept of a “causal SNP” can be unclear, as the assayed SNPs aremostly tag SNPs and the causal SNP itself may not have been assayed. In contrast, here thecausal SNP is always present, and our aim is to assess how well different statistical methodsdifferentiate signal (true causal SNPs) from noise (non-causal SNPs in LD with the causalSNP). We summarised the results of the HAPGEN2 simulations using two measures: the AreaUnder the Receiver Operating Characteristic Curve (AUC) (Hanley and McNeil, 1982) andthe Area under the Precision-Recall Curve (APRC). While the AUC is commonly used forevaluating binary classification performance, it can be misleading when comparing classifierswhere one class vastly outnumbers the other since a classifier with even a tiny false positiverate will incur a large absolute number of false positives, but this failing will not necessarilybe reflected in the AUC (Supplementary Section B.1). In contrast, APRC is more sensitive tofalse positives, and is more suitable when generating biological hypotheses from imbalanceddata.The lasso method was able to detect causal SNPs with fewer false positives than univariablelogistic regression (Figure 6.1). As expected, APRC increased for both methods with samplesize and with increasing risk ratios. Overall, APRC for lasso increases much faster, finallyachieving APRC = 0.8 for risk-ratio of 2.0 compared with APRC of ∼ 0.3 for the univariablemethod. In contrast, AUC results (Supplementary Figure B.2) were consistently higher forunivariable logistic regression for small and medium sample sizes, and roughly equivalent fora sample size of 100,000. These results indicate that whereas the univariable method is betterat finding all causal SNPs, it does so by introducing a large number of false positives, due tohigh LD, manifesting in substantially lower APRC. In contrast, especially for smaller modelsizes, the lasso may have less sensitivity to identify causal SNPs than univariable logisticregression, but the SNPs found are far less likely to be false positives (higher specificity).6.3.2. Modelling genome-wide profiles for eight complex diseasesWe applied lasso models to nine discovery datasets (Table 6.1). Seven discovery datasets werefrom the WTCCC (The Wellcome Trust Case Control Consortium, 2007) — Bipolar Disorder(BD), Coronary Artery Disease (CAD), Crohn’s Disease/Irritable Bowel Syndrome (Crohn),Hypertension (HT), Rheumatoid Arthritis (RA), Type 1 Diabetes (WTCCC-T1D), and Type2 Diabetes (T2D). Two additional discovery sets were celiac disease datasets (Dubois et al.,2010; van Heel et al., 2007), denoted here Celiac1 and Celiac2-UK (samples of UK descent126

6.3. Results1005000.80.60.40.22500+5000++0.80.60.40.210000++20000++APRC0.80.60.40.250000++100000++0.80.60.40.2++++124816326412825651210242048univariable124816Number of SNPs with non−zero effects326412825651210242048univariablelasso + univariableRisk Ratio● 2.0 ● 1.5 ● 1.4 ● 1.3 ● 1.2 ● 1.1Figure 6.1.: APRC for HAPGEN simulations, using either lasso squared-hinge loss models(lasso) or the univariable logistic regression Wald test (univariable). For thelasso, different numbers of SNPs are allowed in the model, as determined by thepenalty λ. For the univariable test, all SNPs are considered. For lasso, resultsare smoothed using LOESS over the replications. For univariable, results areaveraged over the replications. The dotted vertical lines show the number of true“causal” SNPs in the data.127

6. Sparse Linear Models Explain Phenotypic Variation and Predict Risk of Complex DiseaseDisease Abbrev. Cases Controls AutosomalSNPsPlatformReferenceBipolar Disease BD 1868 2938 459,012 Affymetrix500KCoronaryArtery DiseaseCAD 1926 2938 459,012 Affymetrix500K(The WellcomeTrust Case ControlConsortium, 2007)(The WellcomeTrust Case ControlConsortium, 2007)Hypertension HT 1952 2938 459,012 Affymetrix500KRheumatoidArthritisType 1DiabetesType 2DiabetesRA 1860 2938 459,012 Affymetrix500KCeliac Disease Celiac1 778 1422 301,689 Illumina † (van Heel et al.,2007)Celiac Disease Celiac2-UK1849 4936 516,504 Illumina † (Dubois et al.,2010)Crohn’sDisease /Irritable BowelSyndromeCrohn 1748 2938 459,012 Affymetrix500k(The WellcomeTrust Case ControlConsortium, 2007)WTCCC-T1D1963 2938 459,012 Affymetrix500KT2D 1924 2938 459,012 Affymetrix500K(The WellcomeTrust Case ControlConsortium, 2007)(The WellcomeTrust Case ControlConsortium, 2007)(The WellcomeTrust Case ControlConsortium, 2007)(The WellcomeTrust Case ControlConsortium, 2007)Table 6.1.: List of discovery datasets used in this analysis. The 1958 British Birth Cohort(N = 1480) and the National Blood Service (N = 1458) datasets were used asshared controls for all WTCCC datasets. † Celiac1 used Illumina HumanHap33v1-1 for cases and HumanHap550-2v3 for controls, and Celiac2-UK used Illumina670-QuadCustom-v1 for cases and Illumina 1.2M-DuoCustom-v1 for controls.only). We also used three validation sets for celiac disease datasets — the Finnish (Celiac2-Finn), Italian (Celiac2-IT), and Dutch (Celiac2-NL) cohorts from (Dubois et al., 2010) andone validation set for T1D — the GAIN GoKinD (Mueller et al., 2006; Pezzolesi et al., 2009)T1D dataset (GoKinD-T1D) (Table 6.2). For controls, all the WTCCC datasets used boththe 1958 birth cohort (58C) and National Blood Service (NBS) datasets (The Wellcome TrustCase Control Consortium, 2007) as shared controls. We paired the GoKinD-T1D dataset withthe NBS control dataset. To prevent including the same controls in the T1D discovery andvalidation datasets, we also used a version of the WTCCC-T1D dataset that only included theWTCCC-T1D cases and 58C controls, and this reduced version was tested on the GoKinD-T1D data.128

6.3. ResultsDisease Abbrev. Cases Controls AutosomalSNPsPlatformReferenceCeliacDiseaseCeliacDiseaseCeliacDiseaseType 1DiabetesCeliac2-ITCeliac2-NLCeliac2-FinnGoKinD-T1D497 543 516,504 Illumina † (Dubois et al.,2010)803 846 516,504 Illumina † (Dubois et al.,2010)647 1829 516,504 Illumina † (Dubois et al.,2010)1604 1458 265,023 Affymetrix (Mueller et al.,500K 2006; Pezzolesi etal., 2009)Table 6.2.: List of independent replication datasets used. The National Blood Service (N =1458) dataset was used as controls for the GoKinD-T1D dataset. † Celiac2-IT andCeliac2-NL used Illumina 670QuadCustom-v1 for cases and controls, Celiac2-Finnused 670-QuadCustom-v1 for cases and 610-Quad for controls.6.3.3. Assessment of confounding factorsWhile population stratification, batch effects, or data missingness can create spurious associationsbetween the genotypes and case/control status, there was no evidence of confoundingby these factors in the data. We estimated the genomic inflation factors of each discoverydataset, which are based on the median deviation of the per-SNP test statistics from theexpected under the assumption that most SNPs are not truly associated with the phenotype.An inflation factor substantially larger than 1 corresponds to a larger than expectednumber of associated SNPs, potentially due to population stratification (Devlin and Roeder,1999). The genomic inflation factors are shown in Table B.2, indicating low to non-existentlevels of deviation. We also used principal component analysis (Price et al., 2006) (PCA,Supplementary Section B.2.1) to evaluate whether there was strong structure in the data,unaccounted for by LD (Figure 6.4). The genomic inflation factors were 1.0 for the WTCCCdata and between 1.051–1.056 for the celiac discovery datasets, both using the logistic regressiontest, without accounting for LD. Applying PCA to the original celiac disease datasets,we found substantial structure, and several principal components were highly predictive ofthe case/control status (AUC = 0.8, Figure B.5). Regions of strong LD are known to causeartifacts in the analysis (Patterson et al., 2006). Therefore, we removed known high-LD regions(Fellay et al., 2009) (chr5: 44Mb–51.5Mb, chr6: 25Mb–33.5Mb, chr8: 8Mb–12Mb, andchr11: 45Mb–57Mb). In addition, we thinned the remaining SNPs for LD using PLINK, andin the PCA regressed each SNP on the previous five SNPs. After accounting for these strongLD regions, we found no evidence of strong structure in the Celiac1 dataset, and the principalcomponents were not predictive of case/control status (AUC < 0.54, Figure B.5). We alsoplotted the PC loadings of the SNPs in the Celiac1 dataset, aggregated separately in each129

6. Sparse Linear Models Explain Phenotypic Variation and Predict Risk of Complex Diseasechromosome (Figure B.4). As expected, chromosomes 5, 6, 8, and 11 have unusually highloadings in the original data. In the LD-pruned data, the loadings are uniform across thechromosomes, indicating that the high-LD regions above are indeed the main contributorsto the variation found by PCA and that removing LD removes the structure found in PCA.These results, together with the replication in the independent datasets (see below), stronglyindicate that population structure, batch effects, and data missingness were not a significantconfounders for our predictive models.Confounding by non-population effects such as batch effects and differential missingnessis not a significant issue in the discovery datasets of celiac disease and T1D. Such effectsare known to introduce spurious associations between the genotypes and the case/controlstatus, artificially inflating the apparent explained variance and predictive power (Yang etal., 2011). To assess the impact of these effects, we generated a version of the Celiac1 andCeliac2-UK datasets that underwent more stringent filtering than the original data, with theaim of reducing the effect of spurious associations between SNPs and the case/control status(Methods). Overall, we saw only minor reductions in cross-validated AUC for these filtereddatasets. For the stringently filtered celiac disease datasets, the cross-validated AUC for thelasso models peaked at ∼0.87 for both datasets (Figure B.6). The results for stringently-filteredWTCCC-T1D were largely unchanged as well (AUC ∼0.87), indicating that the predictiveability in celiac disease and T1D is likely due to true genetic variation rather than spuriouseffects.6.3.4. Discrimination of the phenotype in cross-validationWe trained lasso squared-hinge loss models on each dataset, and evaluated their discriminationand stability for all directly-genotyped, autosomal SNPs. We evaluated discrimination of thephenotype using AUC on 20 × 3-fold cross-validation (20 × 3CV) for each dataset. Figure 6.2ashows the AUC achieved by varying the number of SNPs in the lasso model for all datasets,using all autosomal SNPs. The number of SNPs at which AUC either peaks or plateaus givesa rough estimate of the number of causal SNPs in each dataset. We can group the datasetsbased on their AUC. The first group includes WTCCC-T1D and Celiac1/Celiac2-UK, bothachieving a maximum AUC of ∼ 0.88. The second group includes RA and Crohn, achievingAUC of up to 0.70–0.74. In the third group includes the rest of the datasets (BD, CAD, HT,and T2D), achieving AUC no higher than 0.65. The lasso models achieved equivalent andsometimes substantially better AUC in predicting case/control status, compared with twoother approaches: multivariable logistic regression on SNPs chosen by univariable statisticsand risk scores produced by multivariable mixed-effects linear models by GCTA (Yang et al.,2011) (Appendix B.2.4).Based on the models’ AUC and estimates of heritability and prevalence, we estimated theproportion of phenotypic variance explained by the models on the liability scale (Wray et130

6.3. Results0.9AUC0.80.70.6DatasetBDCADCeliac1Celiac2−UKCrohnHTRAT1DT2D0.51 2 4 8 16 32 64 128 256 512 1024 2048Number of SNPs with non−zero effects(a) AUC0.350.30VarExp0.250.200.150.10DatasetBDCADCeliac1Celiac2−UKCrohnHTRAT1DT2D0.051 2 4 8 16 32 64 128 256 512 1024 2048Number of SNPs with non−zero effects(b) VarExpFigure 6.2.: (a) Area under the receiver operating characteristic Curve (AUC) for modelsof the 9 case/control datasets. Results are LOESS-smoothed over 20 × 3-foldcross-validation. See the Supplementary Results for details on each disease. (b)LOESS-smoothed proportion of phenotypic variance explained for the lasso modelsfor the 9 discovery datasets, using the method of Wray et al. (2010).131

6. Sparse Linear Models Explain Phenotypic Variation and Predict Risk of Complex DiseaseCeliac2-Finn Celiac2-IT Celiac2-NL Celiac2-UKAUC VarExp AUC VarExp AUC VarExp AUC VarExpMean 0.870 0.300 0.824 0.214 0.850 0.258 0.846 0.25195% LCL 0.869 0.298 0.822 0.210 0.849 0.257 0.844 0.24995% UCL 0.871 0.302 0.826 0.217 0.850 0.260 0.847 0.254Table 6.3.: AUC and explained phenotypic variance for independent validation datasets ofceliac disease models trained on Celiac1. We used models with ∼200 SNPs inthe model, trained in cross-validation on Celiac1 and tested on subsets of theCeliac2 dataset. LCL: lower confidence limit. UCL: upper confidence limit. Theproportion of explained phenotypic variance assumes population prevalence K =1%.al., 2010) (Figure 6.2b). We did not consider genetic variance, as this depends on estimatesof heritability that may vary substantially, although, given a robust estimate of heritability,explained genetic variance can be easily obtained from the explained phenotypic variance. Forlasso models in cross-validation, the top two datasets in terms of explained phenotypic variancewere Celiac1/Celiac2-UK (∼ 32%) and WTCCC-T1D (∼ 28%). Models of RA explainedup to ∼ 10% of the variance, while the rest of the datasets achieved 5% or less.Independent replication of celiac disease and T1D modelsModels of celiac disease developed in the two discovery datasets (Celiac1 and Celiac2-UK)strongly replicated in three independent validation datasets without any further tuning. Basedon cross-validation results in Celiac1, we selected models with 200 non-zero SNPs, and usedthem to predict case/control status in the Celiac2-UK, Celiac2-IT, Celiac2-Finn, and Celiac2-NL datasets. To avoid any optimisation bias, we did not tune the model further on thesedatasets. Despite having different ancestries, and being on different microarray platforms,the lasso models trained on the Celiac1 dataset and tested on the four other datasets showedAUC ranging from 0.824 for Celiac2-IT to 0.87 for Celiac2-Finn, with corresponding explainedphenotypic variance ranging from 21.4% for Celiac2-IT to 30% for Celiac2-Finn (Table 6.3),showing that the predictive power is strongly retained in independent replication. Similarly,we trained models in cross-validation on the Celiac2-UK subset, again choosing all models with∼ 200 non-zero SNPs as indicated by cross-validation within that dataset, and tested them onthe three other Celiac2 subsets (Table 6.4), resulting in AUC of between 0.857 for Celiac2-ITand 0.901 for Celiac2-Finn, with the corresponding explained phenotypic variance of between27.3% for Celiac2-IT and 37.5% for Finn (assuming population prevalence K = 1%), againshowing strong agreement between datasets in terms of celiac disease risk prediction regardlessof ethnic background.132

6.3. Results1.00.8PPV0.60.4●●●● ●●●Dataset●BDCADCeliac1Celiac2−UKCrohnHTRAT1DT2D●0.2●●●●●●0.88 0.90 0.92 0.94 0.96 0.98NPVFigure 6.3.: Lasso models can achieve high positive predictive values. PPV versus NPV forthe lasso models of the 9 discovery datasets. Results are averaged over 20 × 3CV.See the Supplementary Results for the number of SNPs with non-zero coefficientsin each dataset. Note that the curves do not span the entire range of NPV sincenot all sensitivity and specificity values can be observed in a finite dataset.133

6. Sparse Linear Models Explain Phenotypic Variation and Predict Risk of Complex DiseaseCeliac2-Finn Celiac2-IT Celiac2-NLAUC VarExp AUC VarExp AUC VarExpMean 0.901 0.375 0.857 0.273 0.873 0.30795% LCL 0.900 0.373 0.856 0.270 0.873 0.30695% UCL 0.901 0.377 0.858 0.275 0.874 0.308Table 6.4.: AUC and explained phenotypic variance for independent validation datasetsof celiac disease models trained on Celiac2-UK. Models were trained in crossvalidationon the UK subset of the Celiac2 datasets, and tested on the other threesubsets of the Celiac2 dataset. LCL: lower confidence limit. UCL: upper confidencelimit. The proportion of explained phenotypic variance assumes populationprevalence K = 1%.Cross-validation: WTCCC-T1DIndependent validation: GoKinD-T1DAUC VarExp AUC VarExpMean 0.842 0.219 0.842 0.21795% LCL 0.840 0.214 0.832 0.20195% UCL 0.850 0.223 0.852 0.233Table 6.5.: Models were trained in cross-validation on the WTCCC-T1D dataset and testedon the GoKinD-T1D dataset, using ∼ 100 SNPs in the model. The 95% confidenceinterval is derived from the LOESS fit. LCL: lower confidence limit. UCL:upper confidence limit. The proportion of explained phenotypic variance assumespopulation prevalence K = 0.54%.134

6.3. ResultsT1D models trained on the WTCCC-T1D dataset strongly replicated in the GoKinD-T1Ddataset (Table 6.5). We trained models in 20 × 3-fold cross-validation on the WTCCC-T1Ddataset (excluding the NBS control dataset which was paired with GoKinD-T1D), and selectedmodels with about 100 SNPs, based on the fact that AUC was maximised at that number ofSNP in the model (AUC = 0.843). We then applied all models with ∼ 100 SNPs, without anyfurther tuning, to the GoKinD-T1D dataset (AUC = 0.842), with corresponding explainedphenotypic variance of 22% in both datasets (assuming population prevalence K = 0.54%).6.3.5. Genetic models in a population contextStatistical models of complex disease risk are usually trained on case/control datasets wherethe number of cases is much higher than the background prevalence in the general population.For example, the population prevalence of celiac disease is 1% (van Heel and West, 2006),whereas in the datasets used here the proportion of cases is ∼ 30%. Therefore, while amodel may be able to accurately classify patients in our data, it may not have high enoughspecificity to be useful in a population context. To evaluate the precision of our models giventhe estimated population prevalence, we estimated the positive and negative predictive values(PPV and NPV) (Altman and Bland, 1994). PPV can be interpreted as the probability ofhaving disease given a positive diagnosis, and NPV is the probability of not having diseasegiven a negative diagnosis. A perfect model should achieve PPV = 1 and NPV = 1. PPV andNPV were estimated over a range of cutoffs of the classifiers predictions, inducing a curve ofall PPV/NPV value pairs, within cross-validation (Figure 6.3). For consistency, we used thesame population prevalence estimates as in Wray et al. (2010) (Supplementary Table B.3).For all diseases here, except the relatively common hypertension (assumed K = 13.1%),NPV was higher than 0.94 and can be assumed to always be high regardless of the predictivemodel used, since even a model producing random predictions will achieve high NPV whenthe prevalence is low — blindly predicting “no disease” regardless of genotype will be correctmost of the time. In contrast, PPV here depends largely on the model’s predictive abilityand less on the prevalence. Across the genetic models generated, PPV was non-uniformacross all samples (Figure 6.3)—for the majority of the samples it was low (PPV < 0.2),however for a small subset of samples the genetic models achieved both high positive andnegative prediction with NPV > 0.94 and PPV > 0.85. Stringent quality control measures aswell as external validation using the Celiac1/Celiac2-UK datasets did not change the results(Supplementary Figures B.9, B.10, B.11, B.12, B.13). The fact that high PPV was limited toa small number of samples shows that while the models of celiac disease and type 1 diabetescan discriminate cases from controls in the data examined here, these models are not suitablefor population-wide screening as they would generate far too many false positives. The modelsmay be better targeted at screening sub-populations known to be at higher risk.135

6. Sparse Linear Models Explain Phenotypic Variation and Predict Risk of Complex Disease(a)Celiac1(b)Celiac1−stringent●PC26420−2−4−6●●● ●●● ●●●●●●●● ●●●●● ●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●● ●● ● ●●●●●●●●●●●●●●● ●●●●●● ●●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●● ●●●●●●● ●●●●●●●● ●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●● ●●●Specificity● >0.99●

6.4. Discussionconfidence — cases identified at 0.99 specificity or more are those that have been classifiedwith high confidence as cases. For Celiac1 and Celiac2-UK, the high specificity cases are highlyconcentrated in one of the three clusters uncovered by PCA. For T1D, the high specificitysamples can be stratified by PC4, however they are spread across the clusters in PC5.Note that in contrast to using PCA for assessing unknown population structure in the data,here we did not filter out high LD regions or thin by LD since some of these regions, suchas the MHC on chr6, are highly associated with the phenotype, and excluding them wouldremove much of the variation we are interested in finding. However, to address concerns ofspurious case structure due to effects such as differential missingness, we repeated the sameprocedure for the stringently-filtered version of Celiac1, with similar results (SupplementaryFigure 6.4).6.4. DiscussionWe have described here an analysis of the performance of lasso-penalised multivariable models,simultaneous modelling of all SNPs from large-scale datasets, and show that such models canbe plausibly used for the purposes of disease prediction from SNP data. The models developedwere robust to different disease architectures and strongly replicated across independent diseasedatasets, in both celiac disease and type 1 diabetes, without any further tuning required.In addition, the top SNPs identified in the Celiac1/Celiac2-UK and WTCCC-T1D datasetsshowed high stability across the cross-validation replications (Supplementary Section B.3).Across all seven WTCCC and two celiac disease datasets, the lasso method achieved discriminationequal to or higher than both a combined screening/multivariable method and amultivariable mixed-effects linear model. Other possible approaches include logistic regressionwith forward or backward elimination of SNPs (Ayers and Cordell, 2010), however, thesemethods can be highly unstable, leading to different solutions based on the order in whichthe SNPs were added or removed, and getting stuck in local optima (Hastie et al., 2009a).For celiac disease, T1D, and to a lesser extent RA, the major-histocompatibility complex(MHC) region of chr6 contains most of the predictive ability from the SNPs we have considered(Supplementary Figures B.20, B.21, B.25, and B.24). This has also been observed byothers (Wei et al., 2009). Therefore, for prediction of some autoimmune diseases, a reasonableand cost-effective approach is to focus on the MHC. The discriminatory power of our methodon T1D and Crohn’s disease was similar to or better than those reported by others (Kooperberget al., 2010; Roshan et al., 2011; Wei et al., 2009). Some of these studies used lassomodels as well; the small differences may be partially explained by the fact that they prescreenedthe SNPs whereas we did not. We have shown that univariable pre-screening of theSNPs can result in reduced AUC compared with simultaneously modelling all SNPs, with thedifference especially apparent in the Crohn’s disease dataset (Supplementary Figure B.22).137

6. Sparse Linear Models Explain Phenotypic Variation and Predict Risk of Complex DiseaseWhen investigating case substructure and accounting for population prevalence, we haveshown that our genetic models can identify and target particular subsets of cases with highpredictive ability. Within the data, a genetically distinct subset of ∼ 11% of celiac disease casesand ∼13% of T1D cases could be predicted with > 99% specificity, however the low diseaseprevalence in the population requires even higher specificity if population-wide screening isto accurately capture the whole subset. While the absolute numbers of high PPV subsetswere not large in the collections assessed here (N ∼ 5–11), extrapolating the size of geneticallypredictable disease in a population shows these genetic models may have non-trivial potentialfor predicting disease while maintaining a low number of false positives. For example, givena disease with 1% prevalence in a population of 100 million and a genetic model that canpredict 5 out of 2000 cases from a random affected sample, it can be estimated that ∼2,500cases could be highly confidently predicted as disease positive. Further, since these subsetscontain profiles which are highly genetically differentiated, we therefore hypothesise thatthese cases represent a particular subtype(s) of disease which appears to have either a greatergenetic basis or one that is better captured by common SNPs than the remaining cases.6.5. ConclusionsOur analysis of eight human diseases has produced a spectrum of models with different predictiveabilities: models with high predictive ability (T1D and celiac disease), medium (RAand Crohn’s disease), and low (the rest), with substantially differing numbers of SNPs in themodels, ranging from < 100 SNPs for T1D and celiac disease to > 2000 SNPs for T2D, CAD,and BD. These results, together with the case substructure in T1D and celiac disease, suggestthe genetic architecture of complex disease may be more heterogeneous than previouslythought, both between diseases and within the same disease. More work is needed to bettercharacterise these subtypes and to develop genetic models that can better predict the remainingcases. It remains to be seen whether genotypic risk prediction in such subpopulationwill have increased predictive power over the traditional risk factors such as family history.Based on AUC and estimates of heritability, our models of celiac disease and T1D explainedsubstantial proportions of the phenotypic variance. Importantly, the results for celiac diseaseand T1D indicate that, even without explaining all of or even the majority of phenotypicvariance, predictive models still have predictive value. Taken together, our results indicatethat the amount of missing variation in human complex disease, either phenotypic or genetic,may be currently overestimated and that supervised learning approaches can be used to addressthis issue. With the more complete profiles of genomic variation being generated byhigh-throughput sequencing, genetic models of human disease will only increase in predictivepower, bringing the promise of genomics closer to clinical application.138

6.5. ConclusionsFrom a computational and statistical perspective, this work demonstrates the utility ofmultivariable statistical models in genetic risk prediction, fitted to all SNPs rather than justa list of pre-filtered candidates. Specifically, we have shown that lasso penalised modelsfitted to all SNPs are both practical and useful in the genetic setting, and that they achievepredictive ability equivalent to or better than models built on top of pre-filtered candidates.139

7Characterising the Genetic Control of HumanMetabolic GenesIn previous chapters we explored the use of gene expression data for prediction of breastcancer metastasis and of genetic data for predicting complex disease status. However, consideringthe gene expression and genetic aspects in isolation provides only narrow insight intothe underlying biological mechanisms of disease. An integrated analysis of several data typescan potentially provide better understanding of the hidden relationships between the cellularcomponents and biological processes, and the links between these processes and the observedclinical phenotypes. In this chapter, we perform an integrative analysis of SNPs, gene expression,and metabonomic data from a human population cohort, based on the sparse linearmodels discussed earlier.7.1. IntroductionThe availability of gene expression, genetic, and metabonomic datasets has allowed integratedanalyses of the relationships between genetic variation and gene expression (Mackayet al., 2009; Stranger et al., 2007) and between genetic variation and metabolites such asblood lipids (Ferrara et al., 2008; Inouye et al., 2010a,b; Surakka et al., 2011; Teslovich etal., 2010; Tukiainen et al., 2011). These datasets can be further integrated with high-levelclinical phenotypes such as occurrence of disease (Holmes et al., 2008; Nicholson and Lindon,2008). While such studies have produced valuable insights into the regulatory mechanisms141

7. Genetic Control of the Human Metabolic Gene Regulationof gene expression and metabolism, many of these analyses have mainly been concerned withdetecting expression quantitative loci (eQTL) (discussed in Section 2.4.8) of genes or eQTLof metabolites, but not in linking all three aspects together into putative pathways.Inouye et al. (2010a) did perform one such integrated analysis for a specific set of genes,considering the lipid leukocyte (LL) gene module, and inferring the effects of SNPs on moduleexpression and its interactions with serum metabolite levels. The LL module consisted of 11genes and was derived from clustering of gene expression for the top 10% of genes (3520genes) associated with seven lipid measurements. Next, cis- and trans-QTLs associated withthis module were detected. Finally, structural equation modelling was used to infer causalnetworks of lipids and LL genes. Later, Inouye et al. (2010b) expanded this analysis to allmetabolites in the data, identifying associations of the LL module with levels of lipoproteins,lipids, glycoproteins, isoleucine, and 3-hydroxybutyrate, and inferring the causal structure ofthese associations. These findings further highlighted the relationship between mechanismsof immunity and metabolism, suggesting causal mechanisms for this link. In this chapter,we expand the analysis of the DILGOM dataset, employing lasso linear models to detectgene-metabolite and gene-SNP associations in the data, covering all assayed metabolites (136metabolites), genes (35,419 genes), and autosomal SNPs (544,538 SNPs).As with SNP analyses of case/control phenotypes, many studies utilise the univariableapproach, where each SNP is individually tested for association with each gene expressionlevel, using a statistical hypothesis test to assess statistical significance, and possibly applyinga multiple testing correction in order to control the false positive rate. We have shown inChapter 6 that the univariable testing approach can result in higher false positive associationsthan multivariable models that consider all SNPs together. In contrast with the univariablehypothesis testing paradigm, we approach the problem from a predictive perspective, usinglasso-penalised multivariable models for the two tasks of (i) feature selection — identifyinggene expression associated with metabolite levels, or SNPs associated with gene expressionlevels — and of (ii) modelling of the effects themselves. Rather than finding predictors thatare statistically significant, we focus on finding subsets that are highly predictive, in the senseof explaining the most variation in the phenotype. Our aim here is to perform an unbiasedanalysis of all assayed genes and SNPs, rather than candidate sets, with the aim of derivingnew associations between SNPs, genes, and metabolites. Our sub-goals include:• to estimate how much of the variation in metabolite levels can be explained by genes,and in turn, how much of the variation in gene expression can be explained by SNPs;• to compare the degree of genetic control of genes associated with the metabolome comparedwith non-associated genes;• to generate plausible hypotheses of the mechanisms of genetic regulation of metabolitelevels, as mediated by gene expression levels;142

7.2. Methods• to link the genetic, gene expression, and metabolic data back to clinically relevantphenotypes related to disease.We make several simplifying assumptions in our analyses. First, we assume that geneticfactors (SNPs) are always “upstream” of gene expression and metabolites, in the sense thatSNPs can be causal factors of gene expression and metabolites, but not the other way around.This assumption is based both on current understanding of molecular biology, in that SNPsare largely invariant throughout the life of an organism, and underlies genome-wide associationstudies (GWAS) and Mendelian genetics in general. This SNP causal effect may not be directbut can be mediated by other factors, unless the SNP is a cis-QTL in which case we assumethat the effect is direct. In contrast, the question of causality in associations between geneexpression and metabolite levels is less clear, as both factors can potentially be causal to eachother. However, we make a simplifying assumption that the genes mediate between SNPs andmetabolites levels, an assumption likely to hold for at least some genes as metabolites are notcoded for in the DNA and therefore SNP effects on them, if they exist, must be indirectlymediated through gene expression. While feedback loops (metabolites affecting genes) areknown to occur in regulatory and metabolic networks (Alon, 2007), these are more difficultto model, and require time series data, RNA knockdown, or gene knockout data in order toinfer the causal structure.Our analytical pipeline is outlined in Figure 7.1. Briefly, we begin by lasso regression ofeach metabolite on all genes and all clinical variables. We then select the metabolites thatwere predicted with highest R 2 . For these metabolites, we then keep the genes selected in eachmodel. In turn, we regress each gene on age and gender using unpenalised linear regression,and keep the residuals from each gene. Next, we regress the residuals for each gene on allSNPs, to detect QTLs affecting gene expression. Finally, we combine the results in a networkanalysis to derive hypotheses of genetic regulation of metabolites, focusing on metabolitesassociated with fasting glucose levels, a key determinant of type 2 diabetes.7.2. MethodsHere we describe the dataset used in this study and the lasso-penalised models used to detectthe associations.7.2.1. DataThe Dietary, Lifestyle, and Genetic determinants of Obesity and Metabolic syndrome (DIL-GOM) dataset contains 509 human samples in total (234 males, 275 females) in commonbetween the gene expression, SNP, and metabolite data. The individuals were randomlysampled from the Finnish population (Inouye et al., 2010b). The data include:143

7. Genetic Control of the Human Metabolic Gene RegulationMetabolitesGenesClinicalSNPsRemove clinical effect from metabolitesRegress each metabolite on all genesGet predictable metabolitesGet stable genesRemove effect of gender+age from genesRegress each gene on all SNPsGet predictable genesGet stable SNPsInfer causal networksFigure 7.1.: Schematic diagram of our analysis pipeline.144

7.2. Methods• Concentrations of 136 serum metabolites and derived quantities, such as lipids, aminoacids, and glucose levels, measured using 1 H nuclear magnetic resonance (NMR) spectroscopy(Ala-Korpela, 2008). The derived quantities are not metabolites per se, butratios of other metabolites that have been previously found to be biologically meaningful.For convenience, we refer to all the measurements as “metabolites”. Missing valueswere imputed to the median for each metabolite. Zero values were set to the smallestnon-zero value for each metabolite, to prevent missing values later. The metaboliteswere transformed using the Box-Cox transformation to achieve approximate normalityand homoscedasticity (the transformation parameter λ was automatically determinedusing the Guerrero method (Guerrero, 1993) in BoxCox.lambda in R package forecast).Note that while the Box-Cox transformation can make the data more approximatelynormallydistributed, thus reducing the effects of outliers and stabilising the variance,it changes the scale of the data, so that the transformed data are not on the originalmeasurement scale.• Whole blood gene expression of over 35,419 probes, from the Illumina HT-12 ExpressionBeadChip, on log scale. The data were normalised using quantile normalisation (Bolstadet al., 2003).• 544,538 autosomal SNPs, using the Illumina 610-quad SNP array. The quality controlprocedure is detailed in Inouye et al. (2010b). Missing values were randomly imputedwithin the SparSNP analysis.• 21 clinical variables: including gender, age, BMI, weight, height, waist circumference,CRP (C-reactive protein) levels, insulin levels, cholesterol and hypertension loweringmedication, smoking, alcohol intake, blood pressure, and others. Missing values wereimputed to the median for each variable.7.2.2. Predictive ModellingWe employed linear models to model metabolites as functions of gene expression, and geneexpression as functions of genetic variation. The linear model is formulated asy i = β 0 + x T i β + ɛ i , i = 1, ..., N, (7.1)where y i ∈ R and x i ∈ R p are the ith input and output, respectively, β 0 ∈ R and β ∈ R p arethe intercept and weights, respectively, ɛ i ∼ N (0, σ 2 ) is iid Gaussian noise, N is the numberof samples, and p is the number of model weights.For unpenalised models, the model is fitted using maximum likelihood (least squares).For the lasso-penalised models, the model is fitted using penalised maximum likelihood, asimplemented in the R package glmnet and the tool SparSNP (discussed in Chapter 5).145

7. Genetic Control of the Human Metabolic Gene RegulationModel EvaluationasWe measure the predictive ability of a given model using the R 2 , definedR 2 = 1 − ∑N i=1 (y i − ŷ i ) 2∑ N i=1 (y i − ȳ) 2 , (7.2)where y i , ŷ i , and ȳ are the ith output, ith predicted output, and the average output ȳ =1/N ∑ N i=1 y i, respectively. Note that under this definition R 2 ∈ (−∞, 1]. Negative R 2 indicatesthat the model is worse than the model that includes the intercept only, usually caused byoverfitting of the model. R 2 is a function of the mean squared error (MSE) but unlike MSE iscomparable across outputs with different variance. Although R 2 is not strictly a proportion,we can take R 2 to be the explained proportion of the variance by truncating negative R 2 tozero, as negative R 2 means that the model is not explaining any variation.Models of the Metabolitesdifferent inputs:We considered two model classes for the metabolites, using1. Gene expression together with clinical variables2. Gene expression after clinical variable effects have been removedThe clinical variables were included in order to minimise potential confounding caused bydifferences in gene expression between groups such as different genders and different ages,and to assess how much of the remaining variation in metabolite levels can be explainedby gene expression, above variation explained by other clinically important variables such asinsulin levels and CRP (C reactive protein) levels. These models are not equivalent due to thelasso penalisation: in model 1, both genes and clinical variables can be selected with non-zeroweights, and the resulting model will potentially contain both. In contrast, in model 2, we firstremove the additive component of the effect of the clinical variables using unpenalised linearregression, and then model the residual variation in the metabolite using a lasso-penalisedmodel of gene expression.Thus, model 1 is useful for assessing the relative importance of each variable for predictingthe metabolite, but the R 2 is due to both clinical variables and genes, and therefore does notreflect how much of the variation is due to each group of variables alone. The aim of model 2is to estimate the partial contribution gene expression to predicting the metabolite, while stillaccounting for the effects of the clinical variables. The R 2 for model 2 is equivalent to thesemi-partial R 2 (Cohen et al., 2003), which can be interpreted informally as the explainedproportion of variance remaining after removing the metabolite variation attributable to theclinical variables.In evaluating the models’ predictive ability, there is the possibility of overfitting, which isthe situation where the predictive ability in the training data is substantially higher than inthe test data, essentially due to the model fitting to the noise rather than the signal. We146

7.2. Methodsreduce the degree of overfitting by penalisation, chosen by cross-validation, and by usingindependent testing data for estimating the R 2 .To estimate R 2 for models 1 and 2 we used nested 10×10×10-fold cross-validation (Ambroiseand McLachlan, 2002): split the data into 10 folds, run 10-fold cross-validation in each ofthe 10 training folds to select the penalty (from the model with highest R 2 ), train a modelon the entire training data using the chosen penalty, and test this model on the unseen testfold. This process is repeated 10 times. The final R 2 values are averaged over the test setR 2 in the 10 replications. For model 2, we used nested cross-validation but with a two-stagemodel, first removing the effect of the clinical variables, as follows:1. Split the data into training and testing folds.2. In the training data, we regressed each metabolite on the clinical variables using unpenalisedlinear regression.3. We ran lasso on the training data within 10-fold cross-validation, over a grid of penaltiesλmax, ..., λ min , using the gene expression data as input and the residuals from theclinical variables as output. The best model was chosen as the model with highest R 2 .4. We applied the model of clinical variables to the 10% independent test data to derive aprediction of the metabolite levels, and the residuals for the test data.5. We ran lasso on the entire 90% training set using the optimal penalty λ ∗ , and testedthe model on the residuals of the 10% test data.6. Repeated steps 1–5 ten times.7. The final R 2 values were averaged over the test set R 2 in the multiple replications.Models of Gene Expression We used lasso-penalised linear models in the tool SparSNP(Chapter 5) to model gene expression as a function of the SNPs (expression QTL).To estimate R 2 for regressing the genes on the SNPs, we1. Regressed each gene on gender and age, using an unpenalised linear model.2. Obtained the residuals for each gene.3. Regressed the residuals for each gene on all SNPs, using lasso-penalised linear models,repeated within 30 × 10-fold cross-validation.We then ranked the SNPs selected by the lasso method over the cross-validation replications,based on the proportion of replications in which they were selected.147

7. Genetic Control of the Human Metabolic Gene Regulation7.2.3. Causal Network InferenceBased on the predictive genes for the metabolites and the predictive SNPs for the genes, weinferred association networks of genes and SNPs for each metabolite. We refer to the processof determining the causal direction of each edge as orienting the edge. The edges between theSNPs and the genes are directed (causal), as we assume that SNPs can affect gene expressionbut gene expression cannot induce genetic variation.In contrast, the edges between the genes and the metabolites cannot be oriented fromthe observed associations alone — we cannot determine whether gene expression regulatesmetabolite levels or whether it is the other way around without external information. Suchexternal information is provided by cis-QTLs, assumed to be causal anchors (Aten et al.,2008; Schadt et al., 2005): direct causal regulators of gene expression for the gene they areaffect (Veyrieras et al., 2008), which can be used to orient the edges for the rest of the graph.Based on this assumption, any observed association between a gene and its cis-QTL is theresult of a direct causal effect and not due to confounding with other genes or SNPs. We alsoassume that each cis-QTL can directly affect only one gene. In our analysis we considereda SNP associated with a gene to be a cis-QTL if it resides within a 1Mb-wide window ofthe probe’s centre (Stranger et al., 2005). We define a SNP associated with a gene to be atrans-QTL if it is on another chromosome or is not a cis-QTL.There are five possible graphs describing the causal relationships between one SNP (notnecessarily a cis-QTL), one gene, and one metabolite:1. SNP → gene → metabolite2. SNP → metabolite → gene3. SNP → gene ← metabolite4. SNP → metabolite ← gene5. gene ← SNP → metaboliteWe assume that all other graphs involving an edge into the SNP are not possible, as theSNP is always a causal factor. When the model show that the SNP is a causal factor of themetabolite, it does not mean a direct effect, but it can be an indirect effect through othergenes, for example. Further to assuming SNPs being causal (edges out of SNPs not intoSNPs), by restricting our analysis to SNPs that are cis-QTLs we can remove models 2 and4, since they imply an effect of a SNP on the metabolite but not on the gene, contradictingour assumption of direct effect on the gene.Each regulatory model is characterised by a specific pattern of marginal and conditionalindependences (Table 7.1). Under the assumptions of approximate normality of the variablesand linearity of the effects, these independences translate to marginal correlations and partial148

7.2. MethodsModelMarginalindependenceConditionalindependenceCorrelationPartial correlation(1) S→G→M S̸G,S̸M,M̸GS̸G∣M,SM∣G,M̸G∣Scor(S,G)≠ 0,cor(S,M)≠ 0,cor(M,G)≠ 0pcor(S,G)≠ 0,pcor(S,M)= 0,pcor(M,G)≠ 0(2) S→M→G S̸G,S̸M,M̸GSG∣M,S̸M∣G,M̸G∣Scor(S,G)≠ 0,cor(S,M)≠ 0,cor(M,G)≠ 0pcor(S,G)= 0,pcor(S,M)≠ 0,pcor(M,G)≠0(3) S→G←M SM,S̸G,M̸GS̸M∣G,S̸G∣M,M̸G∣Scor(S,M)= 0,cor(S,G)≠ 0,cor(M,G)≠ 0pcor(S,M)≠ 0,pcor(S,G)≠ 0,pcor(M,G)≠ 0(4) S→M←G SG,S̸M,M̸GS̸G∣M,S̸M∣G,M̸G∣Scor(S,G)= 0,cor(S,M)≠ 0,cor(M,G)≠ 0pcor(S,G)≠ 0,pcor(S,M)≠ 0,pcor(M,G)≠ 0(5) G←S→M G̸M,G̸S,M̸SGM∣S,G̸S∣M,M̸S∣Gcor(G,M)≠ 0,cor(G,S)≠ 0,cor(M,S)≠ 0pcor(G,M)= 0,pcor(G,S)≠ 0,pcor(M,S)≠ 0Table 7.1.: The marginal and conditional independence statements that can be derived fromthe (SNP, Gene, Metabolite) graph, and the corresponding correlation and partialcorrelations.correlations, respectively (Whittaker, 1990). Therefore, based on the pattern of marginal andpartial correlations we can discriminate between the three models of regulation (Figure 7.2).We estimated the 3 × 3 partial correlation matrix Π of each (SNP, gene, metabolite) tripletby the Moore-Penrose pseudoinverse of the correlation matrix Σ (obtained from the singularvalue decomposition of Σ), denoted Σ † , and then normalising the entries such thatΠ ij =⎧⎪⎨⎪⎩−Σ † ij√Σ † ii Σ† jjif i ≠ j1 otherwise.(7.3)Since the correlations and partial correlation are observed in noisy data, we must use ameasure of statistical significance to decide whether they are significantly different from zero.We used Fisher’s z-transformation (Fisher, 1921, 1924) to transform the correlation (or partialcorrelation) r to achieve approximate normalityF (r) = 1 2√N − 3 log (1 + r1 − r ) , (7.4)149

7. Genetic Control of the Human Metabolic Gene Regulationpcor(S,M) = 0noyespcor(G,M) = 0(1) S->G->Myesno(5) GM(3) S->G

7.3. ResultsR 20.350 0.355 0.360 0.365 0.370 0.375 0.380 0.385S_VLDL_TG● ●VLDL_TG● Serum_TG●VLDL_TG_eFR●TGtoPG●M_VLDL_L● CH2 groups of mobile lipidsIsoleucineR 20.0 0.2● ●●●●0 40 80 120RankM_VLDL_P●VLDL_D0 2 4 6 8 10 12 14Rank(a) model 1R 20.16 0.17 0.18 0.19●CH2 groups of mobile lipidsSerum_TG●● VLDL_D●VLDL_TG●VLDL_TG_eFR●R 2−0.05 0.05 0.15●●●● ●0 40 80 120RankM_VLDL_L● M_VLDL_P● S_VLDL_TG● M_VLDL_PLM_VLDL_TG2 4 6 8 10 12Rank(b) model 2●Figure 7.3.: R 2 for regressing the metabolites on all gene probes, together with all clinicalvariables (model 1), or after removing the effect of the clinical variables (model 2),showing the top 10 for each model. The results for all metabolites are shown inthe insets. Metabolites were sorted in descending order of R 2 . R 2 was estimatedwith nested cross-validation. Note the different scales.as the effect of the clinical variables has been removed: out of the 136 metabolites, 11 hadR 2 ≥ 0.15 and 22 had R 2 ≥ 0.1. Overall, the metabolite R 2 had a Spearman rank-correlationof 0.782 between model 1 and model 2. Out of the top 10 metabolites in model 1, 7 wereretained in the top 10 metabolites for model 2.In order to find which variables were selected in each metabolite model, we tabulated thevariables (gene expression probes or clinical variables) selected by the lasso model for eachmetabolite. Since the cross-validation replications are split randomly, a gene may be selectedin one replication but not in another. Therefore, we kept only the stable markers — those thatwere consistently included in the model in ≥ 50% of the cross-validation replications. Therewere a total of 1137 and 504 unique markers stably selected for models 1 and 2, respectively,however some probes recurred across many models more than others. For example, HDC(histidine decarboxylase) was the most commonly selected gene, stably selected for 72 outof the 136 metabolites (53%) in model 1, and for 61 (45%) of metabolites in model 2. Formodel 1, the most commonly selected marker was waist circumference (Waist circum), selectedin 86 (63%) of the metabolites. The top ranked markers are shown in Figure 7.4, using tworankings: (i) the proportion of metabolites in which the marker was selected in, and (ii) theproportion weighted by the R 2 , in order to upweight markers associated with metabolites thatcan be better predicted over markers that are common but are associated with metabolites151

7. Genetic Control of the Human Metabolic Gene RegulationProportion of metabolite models0.3 0.4 0.5 0.6●Waist_circum/●ILMN_1792323/HDC●ILMN_1892403/SNORD13●Waist_hip/ILMN_1688423/FCER1A●●Prop. of models0.0 0.2 0.4 0.6ILMN_1766054/ABCA1●●●●●● ●●●●●0 200 600 1000RankILMN_1682402/SNORD46 ILMN_1699071/C21orf7● ●Gender/●ILMN_1654566/HSPA1L ●2 4 6 8 10 12Rank(a) Ranked by proportion, model 1Proportion of metabolite models0.15 0.20 0.25 0.30 0.35 0.40 0.45●ILMN_1792323/HDC●ILMN_1892403/SNORD13ILMN_1688423/FCER1A●●ILMN_1695530/MS4A3●ILMN_1699071/C21orf7●ILMN_1766054/ABCA1●2 4 6 8 10 12RankProp. of models0.0 0.1 0.2 0.3 0.4●●●● ●●●●0 100 300 500RankILMN_1682402/SNORD46ILMN_1859259/● ● ●ILMN_1895673/ILMN_1899555/(b) Ranked by proportion, model 2●Weighted proportion of metabolite models0.06 0.07 0.08 0.09 0.10 0.11●ILMN_1792323/HDC●Waist_circum/● ILMN_1892403/SNORD13●ILMN_1688423/FCER1A●Waist_hip/●ILMN_1766054/ABCA1●Weighted prop. of models0.00 0.04 0.08ILMN_1682402/SNORD46●ILMN_1654566/HSPA1L●●●●●●●●●●●●●●●0 200 400 600 800RankILMN_1699071/C21orf7●ILMN_1834452/Weighted proportion of metabolite models0.020 0.025 0.030 0.035●ILMN_1792323/HDC●ILMN_1892403/SNORD13ILMN_1699071/C21orf7● ● ILMN_1766054/ABCA1ILMN_1695530/MS4A3●●●●Weighted prop. of modelsILMN_1688423/FCER1AILMN_1682402/SNORD46ILMN_1859259/●0.00 0.01 0.02 0.03● ● ●●●●●●●0 100 300 500ILMN_1895673/●RankILMN_1654566/HSPA1L2 4 6 8 10 12Rank(c) Ranked by R 2 × proportion, model 12 4 6 8 10 12 14Rank(d) Ranked by R 2 × proportion, model 2Figure 7.4.: The top 10 variables (metabolites+genes for model 1 and genes for model 2)as selected as predictors of metabolite variation in models 1 and 2. The geneswere ranked by the proportion of metabolites for which each gene was selectedby the lasso regression (a, b) or the proportion times the R 2 for the correspondingmetabolite (c, d), in order to upweight genes that are not only included aspredictors of many metabolites but are also more highly predictive. Each insetshows all the variables for each model. Note the different scales.152

7.3. Resultswith low R 2 . For both model 1 and model 2, 9 of the 10 top markers ranked by proportionremained in the top 10 when ranked by the weighted score, indicating that these markers areboth shared by a substantial proportion of the metabolite models and that these metabolitemodels are the ones with highest R 2 . Between models 1 and 2, 6 and 7 of the top 10 markerswere shared in the proportional and weighted rankings, respectively.The Relative Contribution of Genomic and Clinical Factors to Metabolite Variation Wesought to assess how much of the explained variation in each metabolite was due to clinicalfactors, such as age, gender, and BMI, and how much was attributable to gene expression,under an assumption of additive effects. If the gene expression can explain further variationin each metabolite, above what was explained by the clinical factors, this indicates that thegene expression is capturing some aspect of the metabolite not explained by traditional clinicalindicators.To assess whether the ability to predict each metabolite’s was largely attributable to clinicalfactors or to gene expression, we compared the R 2 for each metabolites between model 1 and 2,ranking them by the ratio of R 2 from model 2 to R 2 from model 1, Rmodel 2 2 /R2 model 1 , afterremoving metabolites with zero or negative R 2 in model 1, as these indicate that the modelhas no explanatory power (we removed 38 such metabolites). Metabolites with ratios > 0.5indicate that the contribution of gene expression to the predictive ability of the model ishigher than the contribution of the clinical variables, and the converse holds for ratios < 0.5.Ratios of zero indicate no gene contribution at all, as all the variance that could be explainedhas been explained by the clinical variables and adding the gene expression does not improvethe model.Of the remaining 98 metabolite measurements, 60 had non-zero ratios (Figure 7.5), indicatingsome genomic contribution to the model predictiveness. Of the 60, nine showed R 2 ratios≥ 0.5: lactate, 3-hydroxybutyrate, acetoacetate, CH2 groups of mobile lipids, total fatty acids(TotFA), mean diameter for very low density lipoproteins particles (VLDL D), cholesterol estersin medium VLDL (M VLDL CE), serum triglycerides (Serum TG), and total cholesterolin medium VLDL (M VLDL C). Lactate was the only metabolite that had slightly better R 2in model 2 than in model 1 (0.115 versus 0.107, respectively), however, the difference maybe due to randomisation in the cross-validation procedure. Thus, for these nine metabolites,we estimate that the majority of the phenotypic variance is explained by gene expressionfactors rather than by clinical factors. While the estimates of R 2 are averaged over multiplenested cross-validation and thus less likely to be spurious than estimates derived withoutcross-validation, however, it may be useful to explore whether there is a multiple testing issuedue to considering many metabolites.Partitioning the Metabolites into SubtypesMany of the metabolites analysed here areknown to be functionally related to each other and can be partitioned into subtypes. We used153

7. Genetic Control of the Human Metabolic Gene RegulationR 2 ratio0.5 0.6 0.7 0.8 0.9 1.0●Lactate●0 20 40 60 80 100Metabolite rank3−hydroxybutyrate●R 2 ratio0.0 0.2 0.4 0.6 0.8 1.0Acetoacetate●● CH2 groups of mobile lipids●TotFA●VLDL_DM_VLDL_CESerum_TG●● ●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●● ●●●M_VLDL_C2 4 6 8 10 12Metabolite rankFigure 7.5.: Ratio of R 2 in model 2 to R 2 in model 1 of metabolites, sorted in decreasingorder. Large figure: metabolites with ratios ≥ 0.5. Inset: all 98 metabolites withpositive R 2 in model 1.154

7.3. Resultshierarchical clustering with complete linkage, and defined eight clusters (Figure 7.6), basedon visual inspection of the plot. The clusters showed substantial differences in terms of R 2for model 2 (Figure 7.7), with clusters 1, 2, and 7 having highest median R 2 . Clusters 1 and 2consist mainly of measurements related to VLDL, and cluster 7 is a small cluster consistingof citrate, glycerol, 3-hydroxybutyrate, and acetoacetate. In cluster 5, lactate was the onlymetabolite with positive R 2 (0.115), whereas the remaining members its group could not bepredicted from the genes.For each metabolic cluster, we tabulated the predictive genes indicated by the lasso modelover the cross-validation replications. Genes were selected as stable if they we in the model atleast 60% of the cross-validation replications. The stable genes selected for each metabolitecluster are shown in Table 7.2. HDC in clusters 1, 2, and 6 (histidine decarboxylase) encodesan enzyme involved in synthesis of histamine, which in turn has wide-ranging effects acrossthe digestive, neural, and immune systems (Schneider et al., 2002). Genetic variation inFCER1A from cluster 1 (Fc fragment of IgE, high affinity I, receptor for; alpha polypeptide)has been shown to be associated with serum immunoglobulin E (IgE), a marker for allergicdisorders and parasitic exposure (Weidinger et al., 2008). ABCA1 in cluster 1 (ATP-bindingcassette, sub-family A (ABC1), member 1) is known to be responsible for cholesterol transportfrom peripheral cells back to the liver (Oram and Lawn, 2001), thereby reducing cholesterollevels in the body, and may also contribute to the processes of atherosclerosis, apoptosis,and inflammation (Soumian et al., 2005). MS4A3 in clusters 1 and 2 (membrane-spanning 4-domains, subfamily A, member 3 (hematopoietic cell-specific)), is responsible for cell signallingin the cell cycle mechanism of hematopoietic cells (Kutok et al., 2011). SLC25A20 in cluster 7(solute carrier family 25 (carnitine/acylcarnitine translocase), member 20) is a gene encodingfor an enzyme involved in transporting long-chain fatty acids into the mitochondria (Iacobazziet al., 2004). CPT1A in cluster 7 (carnitine palmitoyltransferase 1A (liver)) encodes for anenzyme that initiates the oxidation of long-chain fatty acids in the mitochondria (Bonnefontet al., 2004). Currently, there is no known function for SNORD13 from clusters 1 and 2(small nucleolar RNA, C/D box 13) and for C21orf7 in cluster 1.7.3.2. Integrating the Metabolite Models with Models of Gene Expression basedon SNPsHaving developed predictive models of metabolites from gene expression, we turned to modellinggene expression itself using genetic variation in the form of SNPs, to assess whether anyof the genes that were predictive of metabolite levels were themselves under genetic control.Detecting Genetically Regulated Genes For this analysis, we selected genes based on theircontribution to the metabolite models. We ranked the metabolite models in descending orderof R 2 , and used a cutoff of R 2 = 0.1 to select gene probes belonging to these metabolite155

7. Genetic Control of the Human Metabolic Gene Regulation0 10 20 30 401 2 3 4 5 67 8FAw79StoFACH2inFACH2toDB HDL3_CL_VLDL_FCL_VLDL_PLL_VLDL_PL_VLDL_TGL_VLDL_L XL_VLDL_TGXL_VLDL_PLXL_VLDL_LXL_VLDL_P XXL_VLDL_PXXL_VLDL_PLXXL_VLDL_LXXL_VLDL_TGS_HDL_TGXS_VLDL_TGSerum_TGVLDL_TG_eFRS_VLDL_TGS_VLDL_PTGtoPGVLDL_DM_VLDL_TGVLDL_TGM_VLDL_LM_VLDL_PL_VLDL_CL_VLDL_CEM_VLDL_CEM_VLDL_CM_VLDL_FCM_VLDL_PLGlycoprotein acetyls, mainly a1−acid glycoproteinPhenylalanineMUFAFAw79STotFACH2 groups of mobile lipidsCH3 groups of mobile lipidsDouble bond protons of mobile lipidsIDL_TGXS_VLDL_PLXS_VLDL_LXS_VLDL_P IDL_C_eFRS_VLDL_CS_VLDL_FCS_VLDL_PLS_VLDL_LApoBApoBtoApoA1ValineIsoleucineLeucineGlucoseTyrosineCreatinineUreaGlutamineAlbHistidineFAw6LASMIDL_PLIDL_LIDL_PIDL_FCIDL_CFreeCS_LDL_PS_LDL_CS_LDL_LL_LDL_CL_LDL_CEL_LDL_PLL_LDL_LL_LDL_PM_LDL_PLM_LDL_LM_LDL_PLDL_CM_LDL_CM_LDL_CESerum_CEstCL_LDL_FCLDL_C_eFRotPUFAFAw3DHAXL_HDL_TGApoA1TotPG PCS_HDL_LS_HDL_PM_HDL_CM_HDL_CEM_HDL_FCM_HDL_PLM_HDL_LM_HDL_PLactateAlaninePyruvateAcetateLDL_DFAw6toFACitrateGlycerol3−hydroxybutyrateAcetoacetateHDL_CHDL2_C HDL_DL_HDL_CL_HDL_CEL_HDL_FCL_HDL_PLL_HDL_LL_HDL_PXL_HDL_CXL_HDL_CEXL_HDL_PLXL_HDL_FCXL_HDL_LXL_HDL_PFALenFAw3toFABIStoDBDBinFABIStoFA156Figure 7.6.: Hierarchical clustering of the metabolites, using complete linkage.

7.3. Results0.15●R 20.100.05●●●●●0.00●1 2 3 4 5 6 7 8ClusterFigure 7.7.: Box-and-whisker plots of R 2 in model 2 each metabolite cluster, predictingmetabolite concentrations from gene expression.ClusterStable genes1 SNORD13, HDC, C21orf7, ABCA1, FCER1A, MS4A32 HDC, SNORD13, MS4A33 -4 -5 -6 HDC7 SLC25A20, CPT1A8 -Table 7.2.: The stable predictive genes selected for each metabolic cluster (appeared in thelasso model ≥ 60% of the cross-validation replications). “-” indicates that no geneswere stably selected in this cluster.157

7. Genetic Control of the Human Metabolic Gene Regulationmodels, leaving 44 unique probes, which we subsequently used as candidates in the eQTLanalysis.We used lasso linear regression to separately regress each of the selected genes on all SNPs,after having removed the effect of age and gender from the gene expression using unpenalisedlinear regression, as these factors are known to affect gene expression and may confound theanalysis. For each gene, we performed 30 × 10-fold cross-validation, and estimated the crossvalidatedR 2 for each model, across a range of model sizes, and selected the model with thebest cross-validated R 2 .Figure 7.8 shows R 2 for the predicting the gene expression from the SNPs, for each metabolite,where the metabolite to gene association was stable, as discussed in Section 7.3.1. Alsoshown are the R 2 aggregated over all metabolites (“All”) and over a set of random genes.For the random genes, we randomly selected 2000 genes not in the metabolic list, and removedgenes having correlation R ≥ 0.5 with any gene stably selected for the metabolites,leaving 1771 random genes. The random set of genes was used as negative control. The R 2are highly skewed — most genes cannot be predicted from the SNPs, however, a small minorityhas substantial R 2 . Many metabolites are associated with the gene TFG which is highlypredictable from the SNPs (R 2 = 0.741). However, there was no evidence for systematicallydifferent R 2 between the two sets of genes, metabolic versus random (p = 0.613 from two-sidedKolmogorov-Smirnov test that the two R 2 samples come from the same distribution). Thefour genes for isoleucine had the highest median R 2 , with R 2 = 0.495 and R 2 = 0.105 forANKRA2 and TMEM140, respectively, and R 2 = 0 for the two other genes associated with it(MS4A3 and HDC ).Edge Orientation and Hypotheses of Regulation For each metabolite, we used the stablepredictive genes and their stable cis-QTL to infer causal networks using partial correlation.The inferred causal network for serum triglycerides (Serum TG) is shown in Figure 7.9. Usingthree cis-QTLs for TFG (rs13059686, rs591728, rs544500) as causal anchors, we orientedTFG as causal of serum triglyceride levels. In contrast, while ZYG11B is stably associatedwith serum triglycerides and has several associated QTLs, none of the SNPs were found to becis-QTLs, and the orientation of the edge between the gene and serum triglycerides remainsambiguous.7.3.3. Linking the Causal Networks to Fasting Glucose Levels and Type 2DiabetesLevels of blood metabolites, particularly of the amino acids isoleucine, leucine, valine, tyrosine,and phenylalanine, have previously been associated with risk of type 2 diabetes (Wang et al.,2011). Our data does not include type 2 diabetes (T2D) status, however, it does includefasting glucose (FG) levels. FG levels in the blood are commonly used to test for T2D status,158

7.3. ResultsR 20.70.60.50.40.30.20.10.0No. of stable genes1024512256128643216842AcetoacetateApoBtoApoA1BIStoDBBIStoFACH2 groups of mobile lipidsCH2toDBCH3 groups of mobile lipidsDBinFAFALenFAw6FAw6toFAFAw79SFAw79StoFAGlycoprotein acetyls, mainly a1−acid glycoproteinIDL_C_eFRIDL_PLIDL_TGIsoleucineLactateL_VLDL_CL_VLDL_CEL_VLDL_FCL_VLDL_LL_VLDL_PL_VLDL_PLL_VLDL_TGMUFAM_VLDL_CM_VLDL_CEM_VLDL_FCM_VLDL_LM_VLDL_PM_VLDL_PLM_VLDL_TGPyruvateSerum_TGS_HDL_TGS_VLDL_LS_VLDL_PS_VLDL_PLS_VLDL_TGTGtoPGTotFAVLDL_DVLDL_TGVLDL_TG_eFRXL_VLDL_LXL_VLDL_PXL_VLDL_TGXS_VLDL_LXS_VLDL_PXS_VLDL_TGXXL_VLDL_LXXL_VLDL_PXXL_VLDL_TGAllRandomMetabolite●●●●●● ●●● ●●●●● ●●●●●●●● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ●●●● ●●●●●●●●● ●●● ●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●● ●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●Figure 7.8.: Box-and-whisker plots of cross-validated R 2 for the stable genes associated witheach metabolite (predicted from the SNPs), compared with an aggregation of allmetabolites (“All”) and a random set of genes (“Random”). Also shown is thenumber of stable genes associated with each metabolite. 159

7. Genetic Control of the Human Metabolic Gene Regulationrs4926928rs2892800rs1125469rs5148810.130.040.030.05rs11208591rs5158570.13 0.130.04rs109132450.03ZYG11B (0.12)rs60765200.030.040.03rs108076040.02rs978026rs10503719Serum_TG (0.19)rs12495023rs660657rs13059686 0.720.02 0.01TFG (0.74)0.270.03rs591728rs14850030.050.050.35rs544500rs168752900.050.020.360.01rs4273418rs1823870rs2046227rs2419324Figure 7.9.: Inferred network of regulation for serum triglycerides. Inferred causal edges areshown as solid edges. Dashed edges represent trans-QTLs, where direct causaleffect on the gene cannot be inferred. The edges widths are proportional to the R 2of the marginal association between the nodes from a univariable linear regression(in parentheses).160

7.3. Resultswith FG levels of 3.6–6.0 mmol/L considered healthy (n = 331 in the data), 6.1–6.9 mmol/Lindicating pre-diabetes (n = 148), and 7.0 mmol/L and higher indicating T2D (n = 30).Glucose was also included as one of the metabolites measured by NMR, and was highlycorrelated with the clinical FG levels (r = 0.93). However, NMR glucose was not predictablefrom the genes nor from the SNPs using lasso linear models (negative R 2 in both cases, resultsnot shown). The lack of predictive ability by genes or SNPs also means that we cannot infercausal networks for FG, as we did with other metabolites.The DILGOM study is likelyunderpowered to detect signals for fasting glucose, and current GWAS of fasting glucosehave sample sizes in the tens of thousands (Dupuis et al., 2010). Instead, we investigatedassociations between FG and the metabolites, including the amino acids indicated by Wanget al. (2011). We used two lasso-penalised models, a linear model in the log concentrationlog F G i =p∑j=1x i β i + ɛ i , (7.6)where ɛ i ∼ N (0, σ 2 ) is iid random noise, and a logistic model for FG ≥ 7mmol/L (linear inthe log odds)logit(F G i ) =p∑j=1x i β i , (7.7)where logit(r) = log(r/(1 − r)). In both models, we included all metabolites and all clinicalvariables in order to minimise confounding of the association between the metabolites ofinterest and fasting glucose levels. For both models, we used lasso penalisation to select therelevant variables, within 50 × 5 × 10-fold nested cross-validation. We considered metabolitesthat were selected with non-zero weights in more than 70% of the replications as “stable”.Note that there is no intercept in the models as the metabolite concentrations have beenregressed on the clinical variables together with intercept term.All inputs were scaled to zero mean and unit variance, therefore, the regression weightsshould be interpreted in units of standard deviation for each metabolite.For example, aweight of exp β = 1.017 in the linear model (7.6) means that an increase in one standarddeviation in the metabolites is associated with a multiplicative increase by a factor of 1.017in fasting glucose, such as the increase from 6.0 mmol/L to 6.102 mmol/L. Weights less thanone should be interpreted as negative associations: an increase in metabolite levels decreasesfasting glucose levels in the model in the linear model (Eqn. 7.6). Similarly, for the logisticmodel, the weight exp β corresponds to the odds ratio increase for each standard deviation inthe input variable.Figure 7.10 shows the metabolites stably associated with fasting glucose, in the linear andlogistic models, correcting for the effects of clinical variables such as age, gender, insulin levels(known to be highly associated with fasting glucose levels), cholesterol lowering medication,and blood pressure. The appearance of amino acids such as isoleucine, tyrosine, valine, and161

7. Genetic Control of the Human Metabolic Gene RegulationXL_VLDL_PLactate1.0061.007Tyrosine1.009M_VLDL_TG1.017log(Fasting glucose)0.9951.0051.003XL_HDL_TGS_HDL_PValine(a) Linear model of log FGLactateAlanine1.311.522Fasting glucose > 71.2990.767XS_VLDL_PLIsoleucine(b) Logistic model of FG≥ 7mmol/LFigure 7.10.: Metabolites selected as stably associated with fasting glucose, as selected bylasso regression, correcting for the effect of the clinical variables. The edgeweights show the exponentiated weights exp(β), corresponding to increases in(a) fasting glucose and (b) odds ratio of fasting glucose ≥ 7 mmol/L, respectively,for a one standard deviation increase in each metabolite, averaged over the crossvalidationreplications.162

7.3. Resultsalanine in the models of FG is consistent with the findings of Wang et al. (2011) associatingthese amino acids with increase T2D risk. The metabolites selected in the models largelydiffer, with the exception of lactate that appears in both models. We hypothesise that thisdifference may be due to the fact that the linear model attempts to model variation forall levels of fasting glucose, whereas the logistic model only attempts to discriminate lowfrom high levels. There may be non-linear regulatory effects that occur when fasting glucoseincreases from medium to high levels, but that do not manifest when fasting glucose levels arelow. This hypothesis is supported by the findings of a large GWAS of T2D risk and fastingglucose by Dupuis et al. (2010), who report that not all the SNPs associated with fastingglucose across the physiological range were also associated with pathological levels of fastingglucose and high T2D risk, potentially indicating that levels fasting glucose themselves are notnecessarily indicative of high risk, but rather the mechanisms by which fasting glucose levelsare raised may be different between the two regimes. Since we are investigating associationsof metabolites with fasting glucose, we cannot orient the edges between the metabolites andfasting glucose, as the cis-QTLs can no longer be used as causal anchors here — we do nothave a known direct causal driver of either the metabolites or of fasting glucose.For some of the metabolites we could infer the causal networks that affect them, basedon finding cis-eQTL as causal anchors for the genes. Figure 7.11 shows the inferred causalnetwork for the metabolites lactate, triglycerides in medium very low density LDL (M VLDLTG), and isoleucine. For lactate, only one stable and causal probe was found (ILMN 1867138,from the RST13952 Athersys RAGE Library), mapping to a region on chr4 with currently noknown gene annotated, and a corresponding cis-QTL rs6852748. For M VLDL TG, causal anchorcis-QTLs were found for all four genes. PSMD2 (proteasome (prosome, macropain) 26Ssubunit, non-ATPase, 2) encodes for a subunit of an enzyme (proteasome) which is involvedin the production of major histocompatibility complex class 1 (MHC-1) peptides (Kloetzel,2004), an important step in the cellular immune response. TFG (TRK-fused gene) is involvedin the regulation of protein export from the endoplasmic reticulum and has a role inoncogenesis (Hernández et al., 1999; Miranda et al., 2006; Pagant and Miller, 2011). CCS(copper chaperone for superoxide dismutase) delivers copper to the enzyme copper/zinc superoxidedismutase. Copper deficiency has been associated with increased risk of cardiovasculardisease and diabetes, among others, potentially through the effects of increased oxidativestress (Uriu-Adams and Keen, 2005). DGKQ (diacylglycerol kinase, theta 110kDa) encodesfor intracellular lipid kinases, responsible for regulating levels of diaglycerol in the cell membrane,and its products are involved in control of diverse processes such as lipid metabolism,cell growth, membrane trafficking, cell differentiation, and cell migration (Mérida et al., 2008).ANKRA2 (ankyrin repeat, family A (RFXANK-like), 2), physically binds to class II histonedeacetylases which are enzymes that reduce gene transcription by deacetylating the aminoterminaltails of histones, and may be involved in signalling pathways controlling antigen163

7. Genetic Control of the Human Metabolic Gene RegulationMetabolite Gene SNP Chr Nearby gene (within 1Mb)M VLDL TGIsoleucinePSMD2 rs288723 13 -PSMD2 rs3805036 3 ITPR1DGKQ rs225320 21 TMPRSS3DGKQ rs2249431 20 SIRPDCCS rs3817625 14 TRA/TCR/TCRVA15 /TCRATFG rs660657 11 FLI1TFG rs4273418 4 -TFG rs2419324 13 ATP8A2TFG rs2046227 3 GPR128TFG rs1823870 2 LTBP1TFG rs16875290 7 GARSTFG rs1485003 7 PDK4TFG rs12495023 3 ST6GAL1TMEM140 rs10403127 19 MED25TMEM140 rs10899261 11 GUCY2ETMEM140 rs1483179 8 NKAIN3TMEM140 rs336384 4 INPP4BTMEM140 rs6836941 4 KIAA0232TMEM140 rs9561879 13 CLDN10ANKRA2 rs1009697 13 DAOAANKRA2 rs17005004 4 C4orf22Table 7.3.: trans-QTLs for genes associated with the metabolites predictive of fasting glucoselevels.presentations (Mckinsey et al., 2006). Little is known about the role of TMEM140 (transmembraneprotein 140); it is hypothesised to be involved in hematopoiesis (Shimizu et al.,2008), and was recently found to be moderately differentially expressed in a case/controlstudy of preeclamptic pregnancies (Løset et al., 2011).The trans-QTL SNPs shown are associated with the genes to varying degrees, however wecannot infer whether they are direct regulators of these genes or are they instead mediated byother genes or metabolites. The trans-QTL SNPs for each gene are shown in Table 7.3. Themost strongly associated trans-QTL for TFG, rs2046227, resides in proximity to GPR128(G-protein coupled receptor 128). GPR128 and TFG have been shown to create a fusiontranscript, especially in atypical myeproliferative neoplasms but also less commonly in healthypatients (Chase et al., 2010).Population StructureAlthough the DILGOM dataset was randomly sampled from unrelatedindividuals, hidden population structure may still induce spurious associations in the164

7.3. Resultsrs68527480.1ILMN_1867138 (0.06)0.04Lactate (0.11)(a) Lactaters2272473rs2887230.1rs3805036 0.050.040.13 PSMD2 (0.1)rs6845rs12495023rs8737850.31rs225320rs22494310.020.030.45DGKQ (0.47)0.030.02CCS (0.13)0.02M_VLDL_TG (0.16)rs11724804rs4980450.15rs38176250.05rs660657rs130596860.02 0.020.720.05TFG (0.74)rs14850030.05rs168752900.02 0.360.01rs1823870rs20462270.030.270.05 0.35rs4273418rs2419324rs544500rs591728(b) M VLDL TGrs1009697rs13170849rs17005004rs10403127rs10899261rs1483179rs336384rs4728345rs6836941rs95618790.030.50.030.030.040.040.040.120.040.04ANKRA2 (0.5)TMEM140 (0.1)0.050.11Isoleucine (0.03)(c) IsoleucineFigure 7.11.: Inferred causal networks for three metabolites stably associated with fastingglucose levels. The edge weights are the R 2 from a univariable linear regressionof each child node on each parent node. The R 2 from a multivariable lasso linearregression on all inputs (SNPs for genes and genes for metabolites) is shown inparenthesis next to each node.165

7. Genetic Control of the Human Metabolic Gene RegulationPC 1−0.10 0.05 0.15●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●● ●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●● ●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●−0.15 0.00 0.10●●● ●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●● ●●●●●●●● ●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●−0.15 −0.05 0.05●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ● ●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●−0.10 0.05 0.15●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●● ● ●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●● ●●●●●●● ● ●●●●● ●●●●●●●●●●●●●●●●● ●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ● PC 2●●●●●●●●●●● ●●● ●●●●● ●●● ● ●●●●●● ●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●● ●●● ● ●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●● ● ● ●●●●●●●●●●●● ●●●●●●●● ● ●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ● ●●● ●●●●●●●● ● ●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●●● ● ●●●●●●●●●●●●●●●●●●●●●● ●● ● ●●●●●●●● ●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●● ●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●● ●●●● ●●● ● ●●●●●●●●●●●● ●●●●●●●●● ● ●●●●●●●●●●●●●●● ●●● ●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ● ●●●●●● ● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ● ●●●●●●●● ●● ● ●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●PC 3●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●−0.10 0.00 0.10●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ● ●●●●●●●●● ● ●●●●●●●●●●● ● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ● ●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●● ● ●●●●●●●●●● ●●●●●●●●●●●−0.15 0.00 0.10● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●● ●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●● ●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●● ● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●● ● ●●●●●●●●● ● ●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ● ● ●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●● ● ●● ● ●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ● ●●●●●●●● ● ●● ●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●●●● PC 4 ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●● ● ●●●●●● ●●●● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●● ● ●●●●●●●● ●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●● ●●● ●● ●●●●●●● ● ●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●● ● ●●●●● ●●●●●●●●−0.15 −0.05 0.05●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ● ●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●● ●●● ●● ● ●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●●●●●●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●● ●●● ● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●● ●●●●● ●●●● ●●●● ●●●●●●● ● ● ●● ●●●●●●●● ●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●−0.10 0.00 0.10●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●● ●● ● ●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●● ● ● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●● ● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●● ●●●●●●●●● ● ●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●● ●●●●●●●●●●●●●●−0.10 0.05 0.15−0.10 0.05 0.15PC 5Figure 7.12.: Top 5 principal components from PCA of the genotype data.166

7.4. DiscussionProbe Gene λILMN 2341815 TFG 1.0078ILMN 1656676 ZYG11B 1.0006ILMN 1712432 PSMD2 1.0074ILMN 1793017 DGKQ 1.0058ILMN 1766797 CCS 1.0098ILMN 1867138 - 1.0102ILMN 1687351 ANKRA2 1.0030ILMN 1736863 TMEM140 1.0051Table 7.4.: Genomic inflation factors for genes associated with metabolites predictive of fastingglucose, based on the median χ 2 statistics from the linear model of associationin PLINK.data, confounding our analysis. We used smartpca from Eigensoft 4.0beta (Price et al.,2006) to assess stratification in the genotype data (Figure 7.12). There was no evidence forsubstantial substructure in the genetic data. We also estimated genomic inflation factors λfor the genes implicated in the networks for lactate, M VLDL TG, and isoleucine. We usedthe median χ 2 statistic in the linear regression test in PLINK (Purcell et al., 2007) to checkfor possible hidden population structure in the genotype data (Table 7.4), using all SNPs inthe data. There was no evidence for substantial inflation of the test statistics for the genesanalysed.7.4. DiscussionIn this chapter we have analysed the DILGOM dataset, composed of gene expression, SNPs,and metabolite data, assayed from a random sample of a Finnish population. Our aims wereto estimating how much of the variation in metabolite levels could be attributed to geneexpression after accounting for known clinical factors, to assess how much of the predictivegenes could themselves be predicted from SNPs, and to infer causal pathways regulating thesemetabolites.This work expands on previous analyses of the same data (Inouye et al., 2010a,b), thatexamined the lipid-leukocyte (LL) gene module, and inferred causal networks of interactionwith other genes and metabolites. Using lasso penalised linear models, we have uncoveredgenome-wide expression associated with metabolite levels, and for these predictive genes wethen discovered highly associated SNPs. Using cis-QTLs as causal anchors as the basis for apartial correlation analysis, we oriented the edges of the SNP-gene-metabolite networks, inferringcausal pathways of genetic regulation of gene expression associated with serum metabolitelevels. Three metabolites — lactate, isoleucine, and triglycerides in medium VLDL (M VLDLTG) — were stably associated with fasting glucose levels, a commonly-used clinical marker167

7. Genetic Control of the Human Metabolic Gene Regulationfor type 2 diabetes. These results suggest potential causal mechanisms that contribute to thegenetic basis of type 2 diabetes, mediated by gene expression and metabolites. Specifically,we inferred causal networks for isoleucine, known to be associated with future type 2 diabetesstatus (Wang et al., 2011). We reproduced the association of isoleucine with fasting glucosein our data. The causal network for isoleucine included two of the genes associated with it(ANKRA2 and TMEM140 ) that themselves were shown to be under strong genetic controlby cis-QTLs. This causal network represents a novel hypothesis of the gene pathways mediatingthe observed effect of SNPs on fasting glucose levels and on risk of type 2 diabetes, andfurther elucidates the genetic basis for type 2 diabetes.In addition, this work demonstrates the increased insights gained by employing multipledata types, each examining a different aspect of the samples. While associations betweengenes and metabolites were informative, only through the use of SNP data, and of cis-QTLsspecifically, were we able to orient the causal edges of the graph and to generate plausiblehypotheses of genetic regulation mediated by gene expression. Besides the cis-QTLs, thetrans-QTLs provide additional information about the regulatory pathways, such as the associationbetween genetic variation in GPR128 and the gene expression of TFG, which we havepredicted to be a causal factor of levels of triglycerides in medium VLDL.The integrative analysis presented in this chapter was possible due to scalable methods forfitting lasso penalised linear models to genome-wide data. This work has demonstrated boththe practical utility and feasibility of such models for detection of gene-metabolite and SNPgeneassociations. In addition, the use of multi-omic datasets (genes, SNPs, and metabolites)makes it possible to generate more detailed hypotheses of biological mechanisms of cellularregulation, potentially leading to better understanding of the causal structure of disease development.Integration of yet more data types, such as epigenetic markers or structural dataas such as copy number variation, will likely lead to even better understanding of diseaseetiology.LimitationsThis work has several limitations. First, we have used gene expression from lymphocytes inwhole blood, which are likely different from the expression levels of the same genes in thetissue responsible for metabolism, for example the liver; however samples such as this would bedifficult to obtain due to technical and ethical challenges posed by taking biopsies from healthypatients. This means that important associations may not have been found. Second, we onlyemployed linear models and did not model epistatic interactions between genes and genes,genes and metabolites, SNPs and SNPs, or combinations of these factors. Interactions cannotbe ruled out, although Surakka et al. (2011) found only one such interaction, between waisthipratio and rs6448771, with total cholesterol (TC), as a phenotype, which was statisticallysignificant but explains less than 0.5% of the phenotypic variance of TC. Third, we did not168

7.4. Discussionattempt to infer causal graph of all genes related to a given metabolite, but rather onlyconsidered each gene separately. Fourth, we focused on the genes that were predictable fromSNPs; we did not examine genes that were highly of metabolites levels but could not bepredicted from SNPs. Nor did we examine all SNPs potentially associated with metabolites,but filtered the SNPs based on association with predictive genes. In other words, there may yetbe SNPs predictive of metabolite levels but that are not associated with the gene expressionassayed in this data. Fifth, further external validation of our findings in independent multiomicdatasets is required in order to confirm their robustness and allay concerns of confoundingdue to unknown factors. Sixth, larger sample sizes are likely to be needed to achieve enoughstatistical power to detect weaker associations that could be present in this dataset but wereundetected in this analysis. Sixth, moving beyond linear models to more flexible models suchas trees (Random Forests or boosting) would potentially allow detection of non-linear effectssuch as gene-gene interactions that could further enhance prediction ability and may providea more realistic description of the underlying complex biology. Finally, other approachesfor causal structure inference (Aten et al., 2008) could potentially allow us to infer morecomplicated causal structures than we have here, including those involving multiple genesinteracting with each other (a global directed network). Inference of the entire networkof SNPs, genes, and metabolites would potentially provide better elucidation of the causalmechanisms underlying genetic regulation of metabolism and associated phenotypes such astype 2 diabetes.169

8Fused Multitask Penalised RegressionIn Chapter 7 we analysed a multi-omic dataset consisting of SNPs, gene expression, andmetabolite. We used multiple inputs for each model (multiple SNPs per gene, multiple genesper metabolite), but we only used one output: each phenotype was considered independently.However, many phenotypes, for example metabolite levels and gene expression levels, areclearly not independent: there exist strong correlations within the phenotypes. These correlationscould potentially be used to build better predictive models. In this chapter we proposefmpr, a statistical method for taking advantage of the inter-dependencies between the multiplephenotypes, or tasks, for use in gene expression or genetic datasets. In simulation, weshow that our approach induces better predictive models than other approaches that do notconsider the task relatedness, such as the lasso, and in some cases better than the graphguided fused lasso that accounts for task relatedness. In analysis on the DILGOM dataset,fmpr achieves better predictive ability than the lasso.8.1. IntroductionHigh throughput technologies, such as gene expression microarrays and single nucleotide polymorphism(SNP) microarrays, have made it possible to assay thousands and even millions ofpotential biological markers, in order to detect associations with clinical phenotypes such asdisease. At the same time, the definition of what is a phenotype has become more encompassing.Rather than considering only macro-level clinical phenotypes such as presence ofdisease, other lower-level (intermediate) phenotypes, such as gene expression and metabolite171

8. Fused Multitask Penalised RegressionG1G2G3G4G5M2M1M4M5M3Figure 8.1.: An illustration of a hypothetical setup in which five genes G 1 , ..., G 5 are associatedwith five metabolites M 1 , ..., M 5 . Several metabolites share the same geneassociations (solid lines), and therefore are correlated with each other (correlationshown by dashed lines). By leveraging the inter-metabolite correlations, multitaskmethods such as fmpr and GFlasso aim to better identify which inputs(genes in this case) are truly associated with which outputs (metabolites), whileavoiding spurious associations due to effects such as noise, under the assumptionthat correlated outputs are caused by common regulators (pleiotropic genes).levels, are becoming of interest. For example, it is now routine to scan SNPs for potentialexpression quantitative loci (eQTL), regulating the expression of genes (Mackay et al., 2009),or the levels of serum metabolites (Inouye et al., 2010a,b; Tukiainen et al., 2011). With theseanalyses there can be hundreds to thousands of phenotypes rather than the single binarystatus phenotype commonly found in case/control studies.The simplest approach to analysing these multiple phenotypes is to consider each oneseparately, essentially ignoring their inter-dependencies. However, many of these phenotypesare related to each other, and information may be gained from these relationships. Forexample, the expression of some genes can be highly correlated if they are regulated by thesame transcription factor (TF), and a genetic variation at that TF will potentially influenceall of them (so called pleiotropy). Therefore, by considering all phenotypes concurrently,and explicitly accounting for the correlations between them, there is potential for betterstatistical models, and for better understanding of the underlying biological processes thatmay be common to these phenotypes.172

8.2. BackgroundRecently, Kim and Xing (2009) proposed the graph-guided fused lasso (GFlasso), a statisticallyprincipled approach for analysing datasets with multiple related phenotypes, under theassumption that correlated phenotypes are driven by similar underlying causal mechanisms,such as common SNPs affecting the expression of several genes or common pleiotropic genesaffecting metabolite levels (Figure 8.1). GFlasso performs variable selection through thelasso penalty on the model fit for a given phenotype, influenced by the weights of the otherphenotypes correlated with this phenotype. The differences between the weights for the samemarker in the different phenotypes (tasks) are themselves penalised using a lasso-type fusionpenalty, which tends to induce models where the same input (SNP in this case) is selectedto be in the model across all phenotypes, while allowing for the weights to vary betweenphenotypes.We propose an alternative approach, termed Fused Ridge Multi-task Regression (fmpr),that essentially replaces the fused lasso term with a fused ridge term, and present an efficientalgorithm based on coordinate descent to optimise the fmpr loss function. In this chapter,we investigate the performance of fmpr compared with GFlasso and several other methods,on simulated data and on the DILGOM data.8.2. BackgroundThe lasso is a useful method for variable selection and shrinkage in the single-task setting,defined as the solution to the l 1 penalised least squares problemβ ∗ = arg minβ∈R12N∑i=1p(y i − x i β) 2 + λ ∑ ∣β j ∣, (8.1)where y i ∈ R and x i ∈ R p are the ith output and input, respectively, N is the number ofsamples, and β j ∈ R is the model weight (regression coefficient). In this formulation and allfollowing loss functions, we assume that the inputs x and outputs y are standardised to zeromean and unit variance, so we do not include an intercept term in the model. An alternativeapproach to the lasso is ridge regression (l 2 penalty),β ∗ = arg minβ∈R12N∑i=1j=1p(y i − x i β) 2 + λ ∑ βj 2 . (8.2)It is well known that the lasso penalisation induces sparse models — models where manyweights are exactly zero — whereas the ridge penalisation does not, in general (Hastie et al.,2009a; Tibshirani, 1996). The lasso performs both variable selection, setting the irrelevantvariables to zero, and shrinkage (reduction of the weights towards zero), whereas ridge regressiononly tends to shrink the weights. A penalisation scheme that combines the two penaltiesj=1173

8. Fused Multitask Penalised Regressionis the naïve elastic net (Zhu and Hastie, 2005)β ∗ = arg minβ∈R12N∑i=1p(y i − x i β) 2 + αλ ∑ ∣β j ∣ + (1 − α)λ ∑ βj 2 . (8.3)Through the tuning parameter α ∈ [0, 1], a compromise between the ridge regression andlasso can be achieved, and the number of variables that can enter the model is larger thanthe number of observations (unlike the lasso).The lasso, ridge regression, and elastic net are single-task methods, in that they consideronly one output vector y at a time. Multiple tasks can be modelled by fitting a separatemodel to each task, however, none of these methods take into account any information fromthe other tasks when fitting a model for a given task.The graph-guided fused lasso (GFlasso) (Kim and Xing, 2009) is an extension of thelasso, applied to multiple tasks concurrently, and employs the fusion penalty to selectivelymerge together the weights of K related outputs. The motivation is to borrow power acrossmultiple outputs, such that similar outputs (phenotypes) will tend to have similar inputs(such as SNPs) as non-zero, thus tending to select the same inputs across all outputs. TheGFlasso for linear loss is formulated asB ∗ 1= arg minB∈R p×K 2K∑k=1+ γ ∑(m,l)∈ENj=1∑ ∣∣y ik − x T i β k ∣∣ 2 2 + λi=1pK∑k=1ppj=1∑ ∣β jk ∣j=1f(r ml ) ∑ ∣β jm − sign(r ml )β jl ∣, (8.4)where ∣∣z∣∣ 2 2 = ∑N i=1 z2 i is the squared l 2 -norm, B = [β 1 , ..., β K ] ∈ R p×K is the matrix of weightsfor all tasks, f(r ml ) is a function monotonic in the absolute value of the Pearson correlationr ml between the mth and lth phenotypes, and E is the set of inter-task edges. The edges canbe induced by thresholding the Pearson correlation r ml or they can simply be the set of alledges. The set of edges E is assumed to be identical for all p variables, meaning that thedegree of task relatedness does not vary across the variables.j=18.3. MethodsOur aim is to develop a method for modelling multiple related outputs, in a computationallyefficient and thus scalable way.8.3.1. Fused Multitask Penalised RegressionUnlike GFlasso which uses an l 1 fusion penalty, fmpr uses the l 2 norm to shrink thedifferences between them. As discussed below, the change in penalisation has implications174

8.3. Methodsβ−0.20 −0.10 0.00 0.05 0.10 0.151e+00 1e+02 1e+04 1e+06γFigure 8.2.: The solution path of fmpr for one parameter β j over K = 10 tasks, for increasingγ and with λ = 0.both in terms of recovery of the true non-zero inputs, and for optimising the loss function.We formulate the fused ridge penalised squared loss asB ∗ 1= arg minB∈R p×K 2+ γ 2K∑m=1K∑k=1KN∑(y ik − x T i β k ) 2 + λi=1K∑k=1p∑ ∣β jk ∣j=1∑ f(r ml ) ∑[β jm − sign(r ml )β jl ] 2 . (8.5)l=1pj=1This problem, like the GFlasso, is a convex optimisation problem (see, for example, (Boydand Vandenbergh, 2004)). The λ penalty tunes sparsity within each task (lasso). The γpenalty shrinks the differences between weights for related tasks towards zero, but unlikethe GFlasso, does not necessarily encourage sparsity in differences between the weights forrelated tasks. The effect of the fusion penalty can be seen in Figure 8.2, where the weightsfor a single parameter β 1 over K = 10 tasks are shown for increasing values of γ. The fusionpenalty smoothly encourages the weights across the tasks to be more similar to each other,until they are identical. The fmpr loss is a equivalent to the lasso when γ = 0.175

8. Fused Multitask Penalised RegressionWhile the fused ridge approach has been presented here for linear loss, the same methodcan be applied to other loss functions such as logistic and squared hinge loss for classification.A crucial factor in multi-task methods such as fmpr and GFlasso is the definition of taskrelatedness, which is the basis for the fusion penalty. The simplest approach is to thresholdthe Pearson correlation in absolute value, thus only considering as edges correlations with highenough magnitude. However, thresholding has two major disadvantages that greatly limit itsusefulness in practice. First, the threshold is arbitrary, and must be manually tuned for eachset of inputs. Second, the thresholding is inherently a binary operation, and any inputs withcorrelation slightly below the cutoff will be considered as completely unrelated. Determiningthe correct threshold is especially problematic when the data are noisy, as is typically the casein many genomic experiments, in which case the correlation between outputs will be subjectto random fluctuations as well. An alternative approach to thresholding is weighting, wherethe magnitude of the correlation defines task relatedness in a continuous fashion, without theneed for an arbitrary cutoff. Useful weighting functions explored by Kim and Xing (2009)include f(r) = r 2 and f(r) = ∣r∣, and we will use these as well. The weighting functions aremonotonic in ∣r∣, so that negative correlation has the same magnitude of effect as positivecorrelation, except that weights are encouraged to be dis-similar rather than similar.8.3.2. ImplementationWe use cyclical coordinate descent (Friedman et al., 2010) to minimise the fused ridge loss,as outlined in Algorithm 2.Coordinate descent iterates over all variables, one at a time,performing a univariable Newton step with respect to the variable being updated.especially efficient when for fitting sparse models since variables that are not in the modeldo not need to be iterated over in each iteration. Like the lasso and the ridge regression, thegraph-guided fused lasso and the fused ridge are convex problems, and can be solved usingstandard tools from convex optimisation. However, unlike the lasso or the fused ridge whichcan be solved efficiently with coordinate descent, the lasso fusion loss cannot be minimised bycoordinate descent in its original form, as the minimisation process may result in suboptimalsolutions, getting stuck in non-smooth corners of the loss function (Friedman et al., 2007), dueto the non-separability of the penalty (Tseng, 2001). Consequently, other methods have beenproposed, such as the smoothing proximal gradient method (Chen et al., 2012) or reweighted-l 2 methods (Bach et al., 2011; Kim and Xing, 2009).In the coordinate descent procedure, we iterate over each variable j = 1, ..., p in each taskk = 1, ..., K, taking a Newton steps jk = β jk − ∂Lβ jk/ ∂2 L∂β 2 jkIt is, (8.6)176

8.4. Simulationwhere L is the loss function. For linear loss (least squares regression), the first partial derivativesare∂L∂β jk=and the second partial derivatives areN∑i=1x ij (x T i β k − y ik ) (8.7)∂ 2 L∂β 2 jk=N∑i=1x 2 ij. (8.8)The partial derivatives of the fusion penalty Ω areand∂Ω∂β jk=K∑l=1f(r kl )(β jk − sign(r kl )β jl ) (8.9)∂ 2 Ω∂β 2 jk=K∑l=1f(r kl ). (8.10)In a fashion similar to the implementation of the elastic net method (Zhu and Hastie, 2005)(Eqn. 8.3), fmpr implements the combined penalties by first computing the Newton step, thenapplying the fusion penalty to arrive at a penalised Newton step, and finally soft-thresholdingthe penalised Newton step to achieve sparsity through the lasso penalty.8.3.3. Computational EnhancementsWe employ active set convergence (Friedman et al., 2010) to speed up convergence. Briefly, webegin by iterating over all variables. Any variable that becomes zero is declared inactive, andremoved from the active set. Once all variables have converged, as determined by absoluteconvergence of the loss, we iterate over all variables again. If the active set remains the same,then the algorithm terminates. Otherwise, all variables are added to the active set and weiterate over them as before.In addition, we organise the computation such that for each γ, fmpr computes the solutionsover a grid of increasing λ penalties. The smallest λ is user defined, but the largest λ is definedas the penalty making all model weights zero. Thus, if at any stage during the path all weightsbecome zero, then we can terminate the path early, since increasing λ will necessarily resultin all weights staying zero, and the weights for these models do not need to be computedexplicitly.8.4. SimulationWe compared four penalised regression approaches:177

8. Fused Multitask Penalised Regressionwhile not converged dofor k = 1, ..., K dofor j = 1, ..., p do// derivativesd 1 ← ∂L∂β jkd 2 ← ∂2 L∂β 2 jk// inter-task fusion penaltyd 1 ← d 1 + γ ∑ K l=1 f(r kl)(β jk − sign(r kl )β jl )d 2 ← d 2 + γ ∑ K l=1 f(r kl)// Newton steps jk ← β jk − d 1 /d 2// lasso soft thresholdingβ jk ← { 0 if ∣s jk∣ ≤ λ,s jk − λ sign(s jk ) otherwise.endendendAlgorithm 2: An outline of the coordinate descent algorithm for minimising the fusedridge penalised loss function. We assume that each input x j and output y k is standardisedto zero-mean and unit variance. f(r ml ) is a monotonic function in the absolutevalue of the correlation r, such as ∣r∣ or r 2 .178

8.4. Simulation• The fused ridge method fmpr, weighted by the correlation of the outputs (fmpr-w1for ∣r∣ and fmpr-w2 for r 2 );• GFlasso; we used a C implementation of the MATLAB tool spg (Chen et al., 2012) 1 ,weighted by the output correlation (GFlasso-w1 for ∣r∣ and GFlasso-w2 for r 2 );• Ridge regression;• Lasso, using glmnet (Friedman et al., 2010);• Naïve elastic net using glmnet.The first two methods take into account the task relatedness. The remaining three do not,in that the weights of the K tasks are completely unrelated to each other; this is equivalentto K separate regressions. We did not include thresholded versions of fmpr and GFlasso,where the graph is cut based on correlation threshold, since the threshold is set arbitrarily,and in may result in meaningless graphs, specially over cross-validation replications where thecorrelation between phenotypes varies randomly. Kim and Xing (2009) found the weightedversions of GFlasso to consistently outperform the thresholded version in simulation andon real data.We evaluated the models in terms of recovery of the causal input variables, and in termsof predictive ability:• Recovery is binary classification of whether the true weights are non-zero or zero, andis measured using precision/recall and ROC curves. We used 5-fold cross-validationto find the optimal hyperparameters. Then the optimal hyperparameters are used forfitting the model to the entire dataset. The absolute values of the estimated modelweights ∣β∣ are compared against the zero pattern of the true weights, with non-zeroweights taken as the positive class. R package ROCR (Sing et al., 2005) to estimatethese curves.• Predictive ability is measured as R 2 in cross-validation. For all methods, we used nestedcross-validation, where the data are split into 90%/10%. The 90% are used for 5-foldcross-validation, used to find the optimal hyperparameters. Then we fit the model usingthe optimal hyperparameters to the entire 90% subset, and test it on the previouslyunseen 10%. Nested cross-validation eliminates the optimisation bias inherent in usingstandard cross-validation to both optimise hyperparameters and evaluate predictiveperformance (Ambroise and McLachlan, 2002).We considered three classes of experimental setups with varying parameters:1. The same sparsity pattern and same weights across all tasks1 http://www.cs.cmu.edu/~xichen/Code/SPG_Multi_Graph.zip179

8. Fused Multitask Penalised Regressiona) Differing sample sizes N = 50, 100, 200 with p = 100, K = 10, σ = 1, and β = 0.1.b) Differing noise levels σ = 0.5, 1, 2, with N = 100, p = 100, K = 10, and β = 0.1.c) Differing number of tasks K = 5, 10, 20, with N = 100, p = 100, β = 0.1, and σ = 1.d) Differing weights of the causal variables β = 0.05, 0.1, 0.5, with N = 100, p = 100,K = 10, and σ = 1.e) Differing number of variables p = 50, 100, 500, 1000, with N = 100, K = 10, σ = 1,and β = 0.1.2. Same sparsity pattern with different weights β j ∼ N (µ β , σ β )a) β j ∼ N (0.5, 0.05), N = 100, p = 100, K = 10, and σ = 1.b) β j ∼ N (0.5, 0.5), N = 100, p = 100, K = 10, and σ = 1.c) β j ∼ N (0.5, 0.2), N = 100, p = 100, K = 10, and σ = 1.3. Different sparsity pattern with different weights — completely unrelated tasksa) β jk ∼ N (0, 1), N = 100, p = 100, K = 10, and σ = 1.4. Same sparsity and same magnitude weights, but weight signs flipped in some tasks —mixed positive/negative correlationa) β ∈ {−0.1, 0.1}, N = 100, p = 100, K = 10, and σ = 1.The reference setup for all experiments with the same sparsity and same weights wasN = 100, p = 100, K = 10, β = 0.1, and σ = 1.An illustration of the weight matrix B for the three setups are shown in Figure 8.3. Thesame-sparsity same-weights setup appears as bands of uniform colour across all K tasks. Thesame-sparsity same-weights setup shows bands with differing weights across the tasks. Theunrelated setup is random — each task has its own independent set of weights. The inducedinter-task correlations are shown for each setup pattern. For same-sparsity same-weights, allcorrelations are high. For the same-sparsity different-weights setup most correlation are high,with some exceptions. For the unrelated setup, correlations are substantially lower. Note thatthese correlations are strongly dependent on the level of noise: the higher the noise levels,the lower the inter-task correlations will become, regardless of the sparsity pattern.All the input variables x were generated as iid Gaussian random variables. The outputs ywere simulated asy ik = x T i β k + ɛ i , ɛ i ∼ N (0, σ 2 ) i = 1, ..., N, k = 1, ..., K. (8.11)We computed ROC and PRC curves using the absolute weights ∣ ˆβ∣ of each model comparedwith the binary sparsity pattern of the true weights I(β ≠ 0).The results are threshold180

8.4. Simulation30Same sparsity, same weightsSame sparsity, different weightsUnrelated2520Variables p1510510864weights correlation22 4 6 8 102 4 6 8 10Tasks k2 4 6 8 10Figure 8.3.: An illustration of the three sparsity setups used in the multi-task simulations.Top row: absolute values of the p × K weight matrix B used for generating theoutputs y, for models 1, 2, and 3, respectively (model 4 has identical weights andcorrelations in absolute value to model 1). Bottom row: the K × K correlationmatrices of the outputs y.181

8. Fused Multitask Penalised Regressionaveraged (Fawcett, 2006) over 50 independent replications to produce average curves. Forexperiments based on using the same sparsity patterns, we selected 80% of the weights to bezero, on average. Therefore, the a null model (no predictive ability) has an expected areaunder ROC curve of 0.5, and an expected precision of 0.2 (the proportion of the positiveclass) for all recall levels. We computed an unbiased estimate of R 2 for the models withhyperparameters tuned in cross-validation and tested on independent data (not used in thecross-validation).For each penalised method, we explored a grid of penalties across each penalty. We useda grid of 20 penalties λ max × (1, ..., 10 −6 ) and γ = 10 −3,...,6 , where λ max is the smallestlambda that makes all weights in the model zero (see Section 5.4.2) We did not threshold thecorrelation matrix, instead we used the correlation as is, unlike Kim and Xing (2009), sincethe thresholding is often arbitrary and data dependent. Therefore, all correlations were usedas the basis for graph edges (the set E included all possible edges, with different weights basedon the correlation).8.5. ResultsWe compared the performance of fmpr with other methods in simulation and the DILGOMdata.8.5.1. SimulationSetup 1: Increasing sample size Figure 8.4 shows that as sample size N increased from 50to 200, so did the recovery performance of all methods, with fmpr-w1 and fmpr-w2 showingnotably higher ROC and PRC curves than the other methods. fmpr-w2 showed a smalladvantage over fmpr-w1 in terms of recovery. Both fmpr methods increased the recoverywith increasing sample size, much faster than lasso, elastic net, and ridge regression thatimproved only slightly. R 2 increased as well, from approximately 0 for all methods to morethan 0.1 for fmpr-w1 and fmpr-w2.Setup 2: Increasing noise levels Figure 8.5 shows that as noise levels σ increased, so didthe recovery performance of all methods decrease, until it was not much better than randomfor methods with σ = 2. At the lowest and middle noise levels, fmpr-w1 and fmpr-w2 showedbetter recovery than all other methods, with higher R 2 as well.Setup 3: Increasing number of tasks Figure 8.6 shows that with increasing number of tasksK, there was no substantial change in the recovery performance of lasso, elastic net, and ridgeregression, which is consistent with our expectation, as these methods ignore the inter-taskdependencies and are therefore equivalent to K separate models. The small differences are182

8.5. Resultslikely due to random variation in the simulation data, and the fact that with more tasks, theROC/PRC curves are estimated over more data and hence more precise. In contrast, bothGFlasso-w1/w2 and fmpr-w1/w2 showed increases in recovery and in R 2 , with fmpr-w2having best performance overall.Setup 4: Increasing weights Figure 8.7 shows that with increasing weights β, all methodsshowed substantial improvements in both true weight recovery and R 2 , from close to randomperformance with β = 0.05 to high and in some case close to perfect performance for β = 0.5. Inthe reference setting, fmpr had better recovery than all other methods, however, in the highweight setting both fmpr and GFlasso performed equally, with ridge regression showing thelowest performance.Setup 5: Increasing number of parameters Figure 8.8 shows that unlike with increasingnumber of tasks or samples, when the number of parameters p increased, there was no monotonicincrease in performance, but rather all methods improved when p increased from 50 to100, but then their performance reduces as p went to 200. This phenomenon may be due tothe fact that the simulations use a fixed proportion of the weights as zero (20%), togetherwith a fixed weight of 0.1, and a fixed noise level of σ = 1. When p = 50, 100, 200, the output ineach task y k is then a weighted sum of 5, 20, and 40 weights, respectively, plus random noise.Therefore, the signal to noise ratio was not identical between the experiments, but increasingin p, leading to an increase in performance with increasing p, up to the point where there aretoo many variables in the model to estimate from the given amount of data. From the pointonwards, an increase in p with N fixed (together with a concomitant increase in the numberof true non-zero variables), tended to reduce the performance in recovery of the weights.Setup 6 and 7: Differing causal architectures Previous simulation setups explored thescenario where all tasks have the same sparsity pattern and same weights β, showing a strongadvantage of fmpr over other approaches. In contrast, as seen in Figure 8.9, when tasks hadthe same sparsity pattern but the weights differ slightly (µ β = 0.5, σ β = 0.05), both FMPRand GFlasso performed identically well, exhibiting both better recovery and higher R 2 thanelasticnet, lasso, and ridge. When the weights varied more substantially (µ β = 0.5, σ β = 0.5and σ β = 2), fmpr did not perform better than elastic net or lasso (but better than ridge), andGFlasso-w1/w2 performed slightly better in terms of recovery and R 2 . When the tasks werecompletely unrelated, through different sparsity patterns and different weights, all methodsexcept ridge regression performed identically (Figure 8.10).Setup 8: Mixed positive/negative correlations Since the sign of the correlation is accountedfor in the FMPR penalty, we expect to see similar results for positive and negative correlations,as long as their magnitude is the same. To verify this, we performed experiments identical to183

8. Fused Multitask Penalised Regressionthe reference setup, except that the sign of the weights were randomly flipped in some tasks,creating a mix of positive and negative correlations (50% negative on average out of all taskpairs). The results were qualitatively similar to the reference setup, confirming that FMPRtakes advantage of both negative and positive correlations of the tasks.To visualise the effect of each method, Figure 8.12 shows the sparsity patterns recovered byeach method in the simulations for the reference setup, in one simulation, compared with thetrue simulation weights β. fmpr-w1 and fmpr-w2 produced distinct banding patterns, similarto that of the true weights. GFlasso-w2 had similar banding patterns, and to a lesser extentso did GFlasso-w1. In contrast, lasso, ridge, and elastic net did not produce any noticeablebanding patterns at all, with lasso cross-validation selecting a close to empty model (mostweights were zero). These qualitative results complement the quantitative results for weightrecovery shown in Figure 8.4b, where fmpr-w1/w2 showed substantially better ROC andPRC curves than the other methods.8.5.2. Experiments on the DILGOM DatasetAs a proof of concept of the applicability of the multi-task approach to real data, we used theDietary, Lifestyle, and Genetic determinants of Obesity and Metabolic syndrome (DILGOM)dataset (Inouye et al., 2010b), containing 509 samples in total (234 males, 275 females) assayedfor 136 metabolites, and gene expression of 35,419 genes, with the aim of predicting metabolitelevels from the gene expression levels. The preprocessing of the data has been described inChapter 7. The DILGOM data represents a more realistic setting than the simulations, asthere are strong correlations both in the inputs and in the outputs, and any pleiotropic geneis likely to have different associations strengths with metabolites, some of them potentiallywith opposing signs (upregulate one metabolite, down regulate another).To reduce the computational load, we first used univariable linear regression of each metaboliteon each gene, and then selected genes that had at least one metabolite with linear regressionwith p ≤ 5 × 10 −4 (from a t-test with N − 1 degrees of freedom). This resulted in p = 2429genes. We also included the 21 clinical variables in order to reduce confounding of the geneexpression levels due to clinical factors.We selected cluster 1 of the metabolites (defined in Section 7.3.1), a cluster consisting of 35metabolites, mainly VLDL subtypes. This cluster exhibited strong correlations Figure 8.13,and all correlations were positive and many were strong (0.8 and higher). We then usedrepeated 5-fold nested cross-validation to estimate the R 2 : we used 4/5ths of the data foroptimising the penalties (through cross-validation), trained the models on the training datausing the optimal penalties found, and tested the models on the independent 1/5th of thedata. This process was repeated ten times. Due to the high computational cost, we did notevaluate GFlasso (see Section 8.5.3 for timing experiments), elastic net or ridge regression,or the w1 variant of fmpr.184

●●●●●●●●●●●●8.5. ResultsROCPRCSensitivity0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNetPrecision0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNet0.0−0.1−0.2−0.3R 2 −0.4−0.5−0.6lassoGFlasso−w2GFlasso−w1FMPRw2FMPRw1ElasticNetMethodridge0.0 0.2 0.4 0.6 0.8 1.0Specificity0.0 0.2 0.4 0.6 0.8 1.0Recall(a) N = 50ROCPRCSensitivity0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNetPrecision0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNet0.150.10R 20.050.00lassoGFlasso−w2GFlasso−w1FMPRw2FMPRw1ElasticNetMethodridge0.0 0.2 0.4 0.6 0.8 1.0Specificity0.0 0.2 0.4 0.6 0.8 1.0Recall(b) N = 100 (reference)ROCPRCSensitivity0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNetPrecision0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNet0.120.100.080.06R 2 0.040.020.00lassoGFlasso−w2GFlasso−w1FMPRw2FMPRw1ElasticNetMethodridge0.0 0.2 0.4 0.6 0.8 1.0Specificity0.0 0.2 0.4 0.6 0.8 1.0Recall(c) N = 200Figure 8.4.: Simulations with varying number of samples N (Setup 1), showing recovery oftrue causal variables (ROC/PRC) in the training data and R 2 in test set prediction.185

●●●●●●●●●●●●●●●●●●●●●●●●●●8. Fused Multitask Penalised RegressionROCPRCSensitivity0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNetPrecision0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNet0.50.40.3R 2 0.20.1lassoGFlasso−w2GFlasso−w1FMPRw2FMPRw1ElasticNetMethodridge0.0 0.2 0.4 0.6 0.8 1.0Specificity0.0 0.2 0.4 0.6 0.8 1.0Recall(a) σ = 0.5ROCPRCSensitivity0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNetPrecision0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNet0.150.10R 20.050.00lassoGFlasso−w2GFlasso−w1FMPRw2FMPRw1ElasticNetMethodridge0.0 0.2 0.4 0.6 0.8 1.0Specificity0.0 0.2 0.4 0.6 0.8 1.0Recall(b) σ = 1.0 (reference)ROCPRCSensitivity0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNetPrecision0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNet0.010.00−0.01R 2 −0.02−0.03−0.04lassoGFlasso−w2GFlasso−w1FMPRw2FMPRw1ElasticNetMethodridge0.0 0.2 0.4 0.6 0.8 1.0Specificity0.0 0.2 0.4 0.6 0.8 1.0Recall(c) σ = 2.0Figure 8.5.: Simulations with varying levels of noise σ (Setup 2), showing recovery of truecausal variables (ROC/PRC) in the training data and R 2 in test set prediction.186

●●●●●●●●●●●●●●●●●●●●●●●8. Fused Multitask Penalised RegressionROCPRCSensitivity0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNetPrecision0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNet0.00●−0.02R 2 −0.04−0.06lassoGFlasso−w2GFlasso−w1FMPRw2FMPRw1ElasticNetMethodridge0.0 0.2 0.4 0.6 0.8 1.0Specificity0.0 0.2 0.4 0.6 0.8 1.0Recall(a) β = 0.05ROCPRCSensitivity0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNetPrecision0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNet0.150.10R 20.050.00lassoGFlasso−w2GFlasso−w1FMPRw2FMPRw1ElasticNetMethodridge0.0 0.2 0.4 0.6 0.8 1.0Specificity0.0 0.2 0.4 0.6 0.8 1.0Recall(b) β = 0.1 (reference)ROCPRCSensitivity0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNetPrecision0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNet0.80.70.60.5R 2 0.40.3● ● ● ●lassoGFlasso−w2GFlasso−w1FMPRw2FMPRw1ElasticNetMethodridge0.0 0.2 0.4 0.6 0.8 1.0Specificity0.0 0.2 0.4 0.6 0.8 1.0Recall(c) β = 0.5Figure 8.7.: Simulations with varying weights β (Setup 4), showing recovery of true causalvariables (ROC/PRC) in the training data and R 2 in test set prediction.188

●●●●●●●●●●●●●8.5. ResultsROCPRCSensitivity0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNetPrecision0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNet0.020.010.00R 2 −0.01−0.02lassoGFlasso−w2GFlasso−w1FMPRw2FMPRw1ElasticNetMethodridge0.0 0.2 0.4 0.6 0.8 1.0Specificity0.0 0.2 0.4 0.6 0.8 1.0Recall(a) p = 50ROCPRCSensitivity0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNetPrecision0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNet0.150.10R 20.050.00lassoGFlasso−w2GFlasso−w1FMPRw2FMPRw1ElasticNetMethodridge0.0 0.2 0.4 0.6 0.8 1.0Specificity0.0 0.2 0.4 0.6 0.8 1.0Recall(b) p = 100 (reference)ROCPRCSensitivity0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNetPrecision0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNet0.100.05R 2 0.00lassoGFlasso−w2GFlasso−w1FMPRw2FMPRw1ElasticNetMethodridge0.0 0.2 0.4 0.6 0.8 1.0Specificity0.0 0.2 0.4 0.6 0.8 1.0Recall(c) p = 200Figure 8.8.: Simulations with varying number of parameters p (Setup 5), showing recoveryof true causal variables (ROC/PRC) in the training data and R 2 in test setprediction.189

●●●●●●●●●●●●●●●●●●●●●8. Fused Multitask Penalised RegressionROCPRCSensitivity0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNetPrecision0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNet0.80.7R 2 0.60.50.4LassoGFlasso−w2GFlasso−w1FMPR−w2FMPR−w1ElasticNetMethodRidge0.0 0.2 0.4 0.6 0.8 1.0Specificity0.0 0.2 0.4 0.6 0.8 1.0Recall(a) µ = 0.5, σ β = 0.05ROCPRCSensitivity0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNetPrecision0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNet0.900.850.80R 20.750.700.65LassoGFlasso−w2GFlasso−w1FMPR−w2FMPR−w1ElasticNetMethodRidge0.0 0.2 0.4 0.6 0.8 1.0Specificity0.0 0.2 0.4 0.6 0.8 1.0Recall(b) µ = 0.5, σ β = 0.5ROCPRCSensitivity0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNetPrecision0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNet0.95R 2 0.900.85LassoGFlasso−w2GFlasso−w1FMPR−w2FMPR−w1ElasticNetMethodRidge0.0 0.2 0.4 0.6 0.8 1.0Specificity0.0 0.2 0.4 0.6 0.8 1.0Recall(c) µ = 0.5, σ β = 2Figure 8.9.: Simulations with same sparsity but different weights β (Setup 6), showing recoveryof true causal variables (ROC/PRC) in the training data and R 2 in test setprediction.190

●●●●●8.5. ResultsROCPRCSensitivity0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNetPrecision0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNet0.0 0.2 0.4 0.6 0.8 1.0Specificity0.0 0.2 0.4 0.6 0.8 1.0Recall0.900.850.80R 2 0.750.70ElasticNetGFlasso−w2GFlasso−w1FMPRw2FMPRw1lassoridgeMethodFigure 8.10.: Simulations with unrelated tasks (Setup 7), showing recovery of true causalvariables (ROC/PRC) in the training data and R 2 in test set prediction.191

●●●●●●●Lasso●●Ridge8. Fused Multitask Penalised RegressionROCPRCSensitivity0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNetPrecision0.0 0.2 0.4 0.6 0.8 1.0FMPR−w1FMPR−w2GFlasso−w1GFlasso−w2LassoRidgeElasticNet0.0 0.2 0.4 0.6 0.8 1.0Specificity0.0 0.2 0.4 0.6 0.8 1.0Recall0.15R 20.100.050.00GFlasso−w2GFlasso−w1FMPR−w2FMPR−w1ElasticNetMethodFigure 8.11.: Simulations with a mixture of positively and negatively correlated tasks (roughly50%/50% each, Setup 8), showing recovery of true causal variables (ROC/PRC)in the training data and R 2 in test set prediction.192

8.5. ResultsTrue weightsFMPR−w1FMPR−w2LassoRidgeElasticNetGFlasso−w1GFlasso−w2Figure 8.12.: The true non-zero simulation weights β, and the weights estimated by eachmethod in one replication of the reference setup. The intensity of the lines representsthe absolute value of the estimated weight ˆβ. The vertical and horizontalaxes correspond to variables j = 1, ..., p and tasks k = 1, ..., K, respectively. Notethat the weights of the lasso model were all zero.193

8. Fused Multitask Penalised RegressionXXL_VLDL_TGXXL_VLDL_PXXL_VLDL_LXL_VLDL_LVLDL_TG_eFRVLDL_DM_VLDL_PL_VLDL_FCR0.50.60.70.80.91.0L_VLDL_CFAw79StoFAFAw79StoFAL_VLDL_CL_VLDL_FCM_VLDL_PVLDL_DVLDL_TG_eFRXL_VLDL_LXXL_VLDL_LXXL_VLDL_PXXL_VLDL_TGFigure 8.13.: Pearson correlations for 35 metabolites from cluster 1 of the DILGOM metabolites.Recovery of the non-zero weights cannot be evaluated in this data since we do not knowwhich weights are truly non-zero. However, we can qualitatively evaluate the sparsity patterninduced by each model. Examples of the recovered sparsity patterns for the DILGOM datasetare shown in Figure 8.14 (for clarity, we show a subset consisting of 200 genes). fmpr-w2tended to produce sparsity patterns that were more consistent across the tasks. In comparison,lasso produced patterns that varied more between tasks.Figure 8.15 shows the nested cross-validated R 2 for the different methods applied to the 35DILGOM metabolites using the genes as inputs, for each metabolite separately and over allmetabolites. fmpr-w2 showed consistently higher R 2 than the lasso, evaluated either for eachindividual metabolites and for all metabolites together. Overall, 33 metabolites had a highermedian R 2 for fmpr-w2 than for lasso, with the exception of CH2inFA (methylene groups in194

8.5. ResultsLassoFMPR−w2Variables p300 310 320 330 340 350 360 370 380 390 400Variables p300 310 320 330 340 350 360 370 380 390 4005 10 20 30Tasks k5 10 20 30Tasks kFigure 8.14.: Recovered weights matrices ˆB for 200 genes over the 35 metabolites for lassoand fmpr-w2, based on penalties optimised by cross-validation. The verticaland horizontal axes represent genes and tasks, respectively. The intensity ofeach point represents the absolute value of the weight β.195

8. Fused Multitask Penalised Regression0.6* * * *●●●●0.5●●●0.40.3R 2 0.20.10.0●●●●●●●●●●●●●●●●●●●● ●●●●●CH2inFACH2toDBFAw79StoFAHDL3_CL_VLDL_CL_VLDL_CEL_VLDL_FCL_VLDL_LL_VLDL_PL_VLDL_PLL_VLDL_TGM_VLDL_CM_VLDL_CEM_VLDL_FCM_VLDL_LM_VLDL_PM_VLDL_PLM_VLDL_TGSerum_TGMetaboliteMethodFMPR−w2LassoS_HDL_TGS_VLDL_PS_VLDL_TGTGtoPGVLDL_DVLDL_TGVLDL_TG_eFRXL_VLDL_LXL_VLDL_PXL_VLDL_PLXL_VLDL_TGXS_VLDL_TGXXL_VLDL_LXXL_VLDL_PXXL_VLDL_PLXXL_VLDL_TGFigure 8.15.: Box-and-whisker plots of R 2 for fmpr-w2 and lasso over 35 metabolites fromcluster 1 of the DILGOM metabolites, using gene expression as inputs. Weused 10 × 5 nested cross-validation to produce 50 estimates. The stars representstatistical significance from a Bonferroni-corrected Wilcoxon rank-sum test p ≤0.05/35 = 0.00143.fatty acid chain) and HDL3 C (total cholesterol in high density lipoprotein 3). Out of the 35metabolites, 4 metabolites had Wilcoxon rank-sum test p-values lower than the Bonferronicorrected threshold of 0.05/35 = 0.00143.8.5.3. Time ComplexityDue to the simplicity of the optimisation algorithm, the fused ridge method is fast enoughto fit models to large datasets. However, it will inevitably be slower than the lasso due tothe need to compute the fusion penalty and the need to optimise over two separate penalties.Unlike the lasso, fmpr’s time complexity per iteration in the K tasks grows as O(K 2 ) due tothe need to consider differences between all K × (K − 1)/2 pairs for each variable j = 1, ..., p.Figure 8.16 shows the scaling behaviour of fmpr and spg (implementing GFlasso) withrespect to different sample sizes N, parameters p, and number of tasks K (see Figure C.1for boxplots showing the variability over the replications). We used the same penalties forboth methods. All timings were performed on an Intel Core 2 Duo 3.06Ghz machine running196

8.6. DiscussionLinux. Overall, fmpr substantially outperformed spg in all comparisons, especially for largernumber of variables p and tasks K. We also evaluated the scaled time, where the timings ofeach method were scaled to approximately the same range in order to better visualise theirscaling behaviour. The scaled results show linear increases in time for increases in N for bothmethods. In contrast, increasing p results in higher than linear increases for spg but onlylinear increases for fmpr, as fmpr utilises coordinate decent and no matrix operations areinvolved. The results for increasing K show faster than linear growth, consistent with the factthat the number of task to task edges increases quadratically with the number of tasks K.8.6. DiscussionWe have proposed Fused Multitask Penalised Regression (fmpr), a statistical model for leveragingtask correlations in the multi-task setting, in order to borrow information between correlatedtasks. In contrast with the fused lasso that employs an l 1 fusion penalty, fmpr employsan l 2 fusion penalty, making it amenable to fast optimisation using coordinate descent. Wehave assessed the ability of our methods and others to recover the true causal variables insimulations and to predict the phenotype (R 2 ). The l 2 fusion penalty was found to producebetter models when the weights of causal variables in related tasks were of similar magnitude,a result which was robust to varying settings of noise, varying number of samples, varyingnumber of tasks, varying magnitudes of the causal variables, and varying number of inputs.When the simulation weights were vastly different, but still had the same sparsity pattern,fmpr performed well, but the l 1 fusion penalty of GFlasso results in slightly better modelsand better recovery of the non-zero weights. When tasks were completely unrelated, allmethods performed identically, with the exception of ridge regression that was substantiallyworse. In an analysis of the DILGOM dataset, using gene expression to predict concentrationsof 35 correlated metabolites, fmpr-w2 showed better predictive performance than thelasso, in terms of median R 2 , for most of the metabolites (33 out of 35), by leveraging theinter-metabolite correlations.In summary, this work demonstrates the value of multi-task methods in the genomic setting,where multiple phenotypes, such as metabolite concentrations or gene expression levels, aretypically highly correlated. Taking these correlations into account resulted in multi-taskmodels that were consistently better than the single-task models and rarely worse, suggestingthat these methods are safe to use even in realistic settings where the true causal variables arenever known. In addition, the higher computational efficiency of fmpr approach means it canbe applied more readily to larger datasets than the spg method, making fmpr potentiallymore useful in analysis of real datasets.197

8. Fused Multitask Penalised RegressionTime (sec)0.1 0.2 0.3 0.4 0.5 0.6●●FMPRSPG●●●● ●● ● ● ● ●0 1000 2000 3000 4000 5000Samples N●●●●●●Time (scaled)−1.0 0.0 0.5 1.0 1.5 2.0●●●●●●●●●● ● ●● ●0 1000 2000 3000 4000 5000Samples N(a) Increasing samples NTime (sec)0 1 2 3 4 5 6 7FMPRSPG●● ● ● ● ● ●● ●500 1000 1500 2000Variables p●●Time (scaled)−1.0 0.0 0.5 1.0 1.5 2.0●●●●●●●● ● ● ● ●●●●500 1000 1500 2000Variables p(b) Increasing parameters pTime (sec)0 2 4 6 8 10FMPRSPG●●●●●● ●●● ● ●●●Time (scaled)−0.5 0.5 1.0 1.5 2.0 2.5●● ●●● ●●●●●●●●●●0 50 100 150 200Tasks K0 50 100 150 200Tasks K(c) Increasing tasks KFigure 8.16.: Average time to run fmpr over 50 independent replications. (a) p = 400, K = 10.(b) N = 100, K = 10. (c) N = 100, p = 100. The left panel in each subplot showthe wall time, the right panel show time scaled to the same approximate rangein order to better show the trends.198

9ConclusionsA central theme of this thesis has been prediction. It is an investigation of supervised learningtechniques for modelling of associations between molecular data, such as gene expression valuesor SNPs, and variables representing clinical or molecular phenotypes. In this paradigm,models are evaluated mainly based on prediction performance in cross-validation or on independentdata, rather than finding associations that are statistically significant but do not necessarilycontribute much predictive ability, as is commonly the case with many genome-wideassociation studies (GWAS). This thesis has shown that the predictive approach to analysisof gene expression data and large genetic data is computationally feasible, produces interpretablemodels of the underlying biology, that may be precise enough to enable populationwidescreening for celiac disease and type 1 diabetes from SNP data. Sparse linear models ofSNP data often produced models with better predictive ability than other approaches, such asnon-sparse SVMs (Chapter 5) and models built on SNPs selected using univariable statistics(Chapter 6). In addition, the penalised models were more robust across different datasetswith potentially different genetic architectures (Gibson, 2011). These large differences betweenthe methods, show that estimates of the amount of “missing heritability” (Manolio etal., 2009) crucially depend on the statistical model employed, and that the phenotypic andgenetic variance may be better explained by sparse penalised methods. In practical terms,these results strongly suggest that penalised methods are preferable for predictive modelingof human genome-wide case/control data.The penalised models employed in this thesis make the implicit assumption that a “small”number of features are relevant to the outcome, and the rest are spurious or unimportant.199

9. ConclusionsThis assumption is necessary for several reasons. First, it is biologically plausible to assumethat out of the hundreds of thousands of SNPs tagging variation across the human genome,most of them are not subtantially related to a given disease,. Second, for statistical reasons,it is not possible to do meaningful modelling and feature selection under the assumption thatmany or all variables are truly associated with the phenotype, when the sample size is faroutweighed by the number of variables (N ≪ p). Even lasso-like approaches, which havebeen shown to correctly identify the true associations in high dimensions, crucially depend onthe number of these true associations to be bounded relative to the number of variables. Incontrast, non-sparse methods are known to not be able to recover the true causal variables inhigh dimensions. This is termed “bet on sparsity” by Hastie et al. (2009a): “Use a procedurethat does well in sparse problems, since no procedure does well in dense problems”. Third, thesparsity is tuned through cross-validation and guided by the predictive performance. Therefore,we do not impose excess sparsity on the model if predictive performance indicates thatthis is not the right thing to do, but in practice this process often does result in sparse models,either because the sparse models truly are better than dense models due to the (unknown)sparsity of the data or because the dense models are inferior because their coefficients cannotbe estimated well due to the high dimensionality (over-fitting).However, having good prediction of the phenotype is not sufficient: biological interpretabilityis crucial as well, as once good prediction is achieved, one of the most natural questionsto explore next is which of the variables are involved in the model and what is their relativeimportance. For example, in analysis of SNP data we would like to know which SNPs areassociated with the phenotype, and how much variation is explained by each one. For thisreason, lasso linear models are especially attractive as the model weights are directly interpretablein terms of their contribution to the overall model, and the models can be madesparse. Moreover, lasso models enjoy what is sometimes called sparsistency (Kim and Xing,2009; Zhao and Yu, 2006), meaning that under certain statistical assumptions, the truly associatedvariables can be recovered from the data with probability one. In practice, theseassumptions cannot usually be verified in real data. However, the advantage of lasso modelsover the univariable approach in identifying the correct variables can be quantified empiricallyin simulation, where the true causal variables are known. In the real-world setting of genomewidestudies, selecting a high significance threshold generates many false positives which maylead to misleading biological interpretation. On the other hand, a stringent cutoff will discardmany true associations (false negatives), leading to a small number of significant SNPs thatare hard to interpret biologically. Penalised approaches are not completely immune to theseproblems, but have far lower false positive rates in recovering true associations, making usmore confident in the SNPs found. Ultimately all data analysis is limited by the dataset athand. Most currently available SNP datasets have several thousand samples, which may beenough to confidently detect most of the strong effects, but too small to enable detection of200

weak effects which may exist. As datasets grow in size, there is a better chance of confidentlycapturing the weaker effects, thereby refining our models to include many more SNPs in themand potentially getting a more complete picture of the underlying biology.Interpretability is also reduced when different studies produce different lists of prognosticgenes, as has been the case with analysis of breast cancer metastasis datasets. In Chapter 4 weused a “feature engineering” approach in which expression of individual genes is aggregatedto expression of gene sets, and predictive models are in turn applied to this data. The geneset approach achieved far greater stability and consistency between different studies, thusincreasing the interpretability of the genes in terms of coherent cellular processes rather thandisparate lists of loosely related genes.Recovery of causal variables can be quantified using different measures, based on factorssuch as the class balance: the proportion of truly causal variables to the non-causal variables.As discussed in Chapter 6, when the truly causal are a small proportion of the total, measuressuch as ROC curves and areas under ROC curves (AUC) can be misleading, as high sensitivitymay be implicitly inducing a high number of false positives as well. In such cases, weadvocate the use of measures such as precision-recall curves, as they highlight potential highfalse positive rates that may otherwise go unnoticed. More generally, these results also highlightthe importance of choosing a suitable measure of success for each problem, as differentmeasures emphasise different aspects of the model. Therefore, we suggest evaluating modelperformance using a variety of complementary measures, when possible. While the simulationswere generated based on real haplotype data from HapMap, there is room to explorehow well the simulations represent real case/control data, and whether these results hold inother more extreme settings such as when the case/control labels are more imbalanced.Two important computational themes of this thesis have been efficiency and scalability:sophisticated models are of little practical use unless they can successfully be applied toreal data. As models have become more sophisticated, so has data size grown: recent SNPmicroarrays approach 2 million SNPs, and sample sizes are increasing as well, with severaldatasets consisting of around one hundred thousand samples now available. To enjoy thebenefits of models such as the lasso, efficient and scalable algorithms must be developed forfitting these models to such data. The algorithms we have discussed in Chapters 5 and 8 arebased on coordinate descent, achieving scaling in computation time that is linear in eitherthe number of samples or the number of input variables, without requiring all data to bein memory at once, making them particularly attractive in the genome-wide setting, andenabling us to rapidly analyse large datasets, as we have done in for SNP data in Chapter 6and for multi-omic data in Chapter 7.Finally, analysis of large datasets itself produces large amounts of output that itself needs tobe post-processed: for example, cross-validation produces multiple models, each with its ownversion of the model weights. Our analysis of the DILGOM dataset generated close to 1.5TiB201

9. Conclusionsof results. There is a need for better computational approaches to store and postprocess theseresults in order to effectively and efficiently extract information from the raw data. Thesechallenges will only increase as datasets grow larger in the number of features, and as weintegrate more datatypes, such as in analysis of larger multi-omic datasets and analysis ofwhole genome sequencing data.Future WorkThere are several ways in which the work in this thesis can be extended. We begin by consideringseveral extensions to the analysis of the breast cancer gene expression data. Next, weconsider extensions to the sparse linear models, including lasso and fmpr, and the algorithmsfor fitting these models to data.We analysed the breast cancer datasets using non-sparse methods, such as centroid classifiersand support vector machines. There may be benefit from analysis using sparse models.However, the high degree of correlation between genes, particularly, between the causal andnon-causal genes, may potentially hamper the ability of lasso models to correctly identifythe causal variables (Zhao and Yu, 2006). An important challenge in this area is merging ofdatasets from separate studies, as individual gene expression datasets tend to be limited toseveral hundred samples, limiting the statistical power achieved in any one study. There mayalso be room to examine set statistics that take into account pathway structure, in the formof directed graphs. For example, if we consider a gene module as an information processingmodule (Alon, 2007) with defined inputs and outputs, we might be able to concentrate ourefforts on several genes deemed to be output nodes (downstream of all others nodes in thenetwork) and ignore internal nodes that may only mediate between genes in the module.In the analysis of SNP data using SparSNP, we have only considered models that areadditive in the allele-dosage. While additivity is a common simplifying assumption in associationstudies, and has been justified on grounds of both genetic theory and experiments (Hillet al., 2008), other genetic configurations are known to occur, such as dominant/recessivealleles, and there may be benefit from exploring these models in real data. If there is priorknowledge about the mode of action of a SNP, then it is a reasonable approach to representit in the model, however, for many SNPs the mode of action is not known a priori. Exploringall possible combinations of all such multivariable models is computationally infeasible(there are k p possible permutations if k models are considered for each of the p SNPs). Asimpler procedure may be to test several models for each SNP independently through a seriesof univariable tests, select one mode of action for each SNP, and then build a multivariablemodel combining the per-SNP models into one. If non-linear effects are desired, they can beachieved through the use of kernel methods, such as polynomial kernels and Gaussian kernels(radial basis function kernels), at the cost of reduced interpretability, as the model weights202

are with respect to the samples and not the features. A preliminary investigation of a celiacdiseasedataset using SVMs with Gaussian kernels did not find any predictive advantage overthe linear squared hinge loss employed by SparSNP (results not shown). However, improvementsfor other datasets cannot be ruled out, as the genetic architecture likely varies betweendiseases. In addition to genetic data, we may include other variables in the SparSNP modelbesides genotypes, such as age, sex, or any other clinical variable of relevance. Conceptually,these variables can be treated just like the genotypes in the model.Another avenue we did not explore is epistasis, loosely defined as non-linear interactionsbetween genes or SNPs. Considering epistasis in the predictive models raises formidable computational,statistical, and interpretation challanges. Epistasis must first be detected in thedata, however, there is a combinatorial explosion of the number of epistatic sets of SNPsthat need to be examined, even for very simple forms of epistasis. For a dataset of 500,000SNPs, there are more than 10 11 pairs of SNPs that need testing, and for triplets there aremore than 2 × 10 16 sets. Clearly, examining even small epistatic sets is computationally demanding.Furthermore, even if all such pairs are tested for association with the phenotype,the multiple testing correction for the multitude of tests will be severe but necessary, as evena tiny false positive rate will incur a large number of false detection in absolute numbers.Quality control of the data will also need to be especially stringent, as the effects of minorgenotyping error or other confounding factors such as cryptic relatedness or populationstratification can potentially be amplified when considering sets of SNPs together. Onceputative sets of epistatic SNPs are detected, they may be used for predictive modelling, asare individual SNPs now. However, substantially larger datasets may be required to reducethe effects of overfitting induced by models with potentially very large numbers of variables.These obstacles notwithstanding, tackling epistasis, at least simplified versions of it, may be afruitful avenue for generating more realistic models and consequently for explaining yet morephenotypic variation.Another important source of phenotypic variation not considered here is the environment,such as gene × environment interactions. The case/control SNP datasets examined here werepurposely designed to minimise confounding by environmental factors, by sampling unrelatedindividuals from the same ethnic populations. However, rather than removing such effects,there may be advantages in leveraging data from individuals in shared environments, as longas this confounding as properly accounted for, as has been done in pedigree studies beforethe advent of GWAS.Alternative penalisation schemes to the standard lasso include the elastic net (Zhu andHastie, 2005), adaptive lasso (Zou, 2006), or non-convex penalties such as SCAD (Fan andLi, 2001). The advantage of some non-convex penalties over the l 1 penalty is that they tendto penalise the non-zero weights less, resulting in less biased estimates and in this sense theybetter approximate the l 0 loss. However, non-convexity means that optimisation is difficult,203

9. Conclusionsin general, since the existence of global optima is not guaranteed. Optimising these nonconvexmodels typically involves generating multiple solutions from random initialisationsand choosing the best solution amongst them. The lasso method, like other frequentistapproaches, does not explicitly take into account prior knowledge that the researcher may haveat their disposal, such as known SNP-gene associations and known gene mappings to genepathways, which could be used to influence the selection of SNPs in the model. HierarchicalBayesian models (Gelman and Hill, 2007; Kim and Xing, 2011) can incorporate such priors inthe models, accounting for this information. In addition, Bayesian models also allowing forinclusion of multiple populations in the data analysis, and enable handling of data missingness(imputation) as part of the model fitting process.The fused penalty approach fmpr was presented in terms of squared loss for linear regression,however it can be easily adapted to any linear or log linear model, such as logisticregression and squared hinge loss classification, for use with binary phenotypes such ascase/control status. There may be room to employ mixed l 1 /l 2 fusion penalties in order toachieve greater flexibility in modelling, such that another hyperparameter may automaticallydetermine whether to use an l 1 or an l 2 penalty in a data-dependent way, similar to the hyperparameterα in the elastic net (Zhu and Hastie, 2005). In addition, instead of using the samevalue of the penalty for all tasks, it may be useful to have a per-task penalty λ k , k = 1, ..., K,rather than one global penalty λ, potentially allowing for better sensitivity to the uniquecharacteristics of each task. This will require more computation due to the need to tune eachpenalty separately with cross-validation. There is also room for further research into measuresof task relatedness other than Pearson correlation, and into different transformations of thecorrelation. Given enough data, it may be possible to reliably infer task relatedness from thedata itself, perhaps using partial correlation or sparse partial correlation in the form of thegraphical lasso (Friedman et al., 2007).Both SparSNP and fmpr are fitted using coordinate descent, a simple yet effective approach.However, when analysing even larger amounts of data such as large SNP arrays orsequencing data, standard coordinate descent may still be too slow. The simplest way to scaleup coordinate descent is parallelisation through splitting the SNPs into several blocks andperforming the updates concurrently, a procedure which has been shown to be feasible in thesense of convergence of the algorithm (Bradley et al., 2011). This scheme leads to speedupslinear in the degree of parallelism, up to the theoretical maximum (after which the algorithmmay not converge). Such an approach would likely increase the size of datasets that couldbe analysed into many millions of SNPs. Another avenue is to use specialised hardware, suchas high performance Graphical Processing Units (GPU), which have been used in detectionof epistasis in SNP data (Kam-Thong et al., 2011). Beyond SNP data, large-scale parallelisationcould potentially enable fitting statistical models to whole-genome sequencing data,when such data becomes routinely available for large cohorts.204

ASupplementary Results for Gene Set StatisticsA.1. ClassifiersIn addition to the centroid classifier, we tested the shrunken centroid (Tibshirani et al.,2003) in the R package pamr (Hastie et al., 2009b), our implementation of the classifierfrom (van ’t Veer et al., 2002), and a support vector machine with a linear kernel (kernlabpackage (Karatzoglou et al., 2004)). We optimised the shrunken centroid’s threshold and theSVM’s number of features and its l 2 penalty using nested random splits, where the data wasrandomly split into three parts: training, validation, and testing. The model was fit to thetraining data, and its AUC calculated for its prediction on the validation data. This wasrepeated over a grid of values appropriate for each model type. The optimal hyperparameterswere then chosen as the ones maximising the AUC over the validation set. The model wasthen refit using the optimal hyperparameters on the training and validation data together,and tested on the remaining test data. Its AUC over the test data is reported. The wholeprocedure is repeated B times, producing B classifiers (for each classifier type), with differentsets of optimal hyperparameters. The procedure is performed separately for each of the fivedatasets.There are conflicting descriptions of the exact form of the classifier used in (van ’t Veer etal., 2002). In the original paper, it seems that the classifier classifies each sample using itsPearson correlation with each of the centroids of the positive and negative metastasis classes:ŷ i = arg min {Corr(x i, c j )},j∈{−1,+1}(A.1)205

A. Supplementary Results for Gene Set Statisticswhere Corr(⋅) is the Pearson correlation, x i is the ith sample of p genes, and c j is the centroidof the j class where j ∈ {−1, 1}. In other publications (Tibshirani and Efron, 2002; van deVijver et al., 2002), the classifier said to be based on the correlation of the sample with thepositive class only, and a threshold on that correlation is used to determine which class ispredicted:ŷ i =⎧⎪⎨⎪⎩+1 if Corr(x i , c 1 ) ≥ τ;−1otherwise,where c 1 is the centroid for the positive class, and τ is a user-specified threshold.(A.2)We implemented both approaches, denoting them here VV1 and VV2 respectively. For theVV1 approach, we did not choose a threshold but used the correlation with the positive classas the prediction.The VV2 approach is identical to the centroid classifier used in our work when the sampleshave been normalised so that they have unit norm (McLachlan et al., 2004, pp. 202–203).A.2. Internal ValidationFigures A.1, A.2, A.3, A.4, and A.5, show results for internal validation of the centroidclassifier, the SVM, the PAM classifier (shrunken centroid), and the two van ’t Veer et al.(2002) classifiers, respectively. For the centroid, recursive feature elimination (RFE) was used.For the other classifiers, all features (all genes or all gene sets) were used.A.3. External ValidationFigure A.6 shows AUC for external validation for the different models.Significance of AUC DifferencesWe used ANOVA to test for differences in AUC between the set statistics, produced by thecentroid classifier. AUC is approximately normally distributed when it is not too close to zeroor one. We present the results for 1, 8, 64, and 4096 features, as shown in Table A.1. Takinginto account multiple testing, the ANOVA shows that the AUC for set centroid, set median,and set t-statistic are not significantly different from the AUC for the individual genes. TheAUCs for set PC and set U-statistic are significantly lower.206

A.3. External Validation0.80.80.6●●●●●●●●●●●●● ●0.6●●●●●●● ● ●●●●AUCAUC●●0.40.40.2set.centroidsset.mediansset.medoidsset.medoids2●0.2set.centroidsset.mediansset.medoidsset.medoids2●1248163264128256512102420484096541412481632641282565121024204840965414FeaturesFeatures(a) GSE2034(b) GSE49220.80.8AUC0.6●●●●●●●●●●●●●●AUC0.6●●●●●●●●●●●●● ●0.40.40.2set.centroidsset.mediansset.medoidsset.medoids2●0.2set.centroidsset.mediansset.medoidsset.medoids2●1248163264128256512102420484096541412481632641282565121024204840965414FeaturesFeatures(c) GSE6532(d) GSE73900.8●●●●●●● ● ●●●●●0.6●AUC0.40.2set.centroidsset.mediansset.medoidsset.medoids2●1248163264128Features256(e) GSE11121Figure A.1.: Internal validation (mean and 95% CI for AUC) for centroid classifier with RFE5121024204840965414207

A. Supplementary Results for Gene Set Statisticsgsvmgsvm0.80.8AUC0.60.4● ● ●●●●AUC0.60.4●●●●●●0.20.2rawset.centroidsset.mediansset.medoidsset.pcsset.u.lograwset.centroidsset.mediansset.medoidsset.pcsset.u.logStatisticStatistic(a) GSE2034(b) GSE4922gsvmgsvmAUC0.80.60.4●●●●●●AUC0.80.60.4●●●●●●0.20.2rawset.centroidsset.mediansset.medoidsset.pcsset.u.lograwset.centroidsset.mediansset.medoidsset.pcsset.u.logStatisticStatistic(c) GSE6532(d) GSE7390gsvm0.80.6●●●●●●AUC0.40.2rawset.centroidsset.mediansset.medoidsset.pcsset.u.logStatistic(e) GSE11121Figure A.2.: Internal validation (mean and 95% CI for AUC) for SVM classifier, using allfeatures208

A.3. External Validationpamrpamr0.80.8AUC0.60.4●●●●●●AUC0.60.4● ● ●●●●0.20.2rawset.centroidsset.mediansset.medoidsset.pcsset.u.lograwset.centroidsset.mediansset.medoidsset.pcsset.u.logStatisticStatistic(a) GSE2034(b) GSE4922pamrpamr0.80.8AUC0.60.4●●● ● ● ●AUC0.60.4● ● ● ● ●●0.20.2rawset.centroidsset.mediansset.medoidsset.pcsset.u.lograwset.centroidsset.mediansset.medoidsset.pcsset.u.logStatisticStatistic(c) GSE6532(d) GSE7390pamr0.8● ● ● ●●0.6●AUC0.40.2rawset.centroidsset.mediansset.medoidsset.pcsset.u.logStatistic(e) GSE11121Figure A.3.: Internal validation (mean and 95% CI for AUC) for PAM classifier209

A. Supplementary Results for Gene Set Statisticsvvvv0.80.8AUC0.60.4●●●●●●AUC0.60.4● ● ● ●●●0.20.2rawset.centroidsset.mediansset.medoidsset.pcsset.u.lograwset.centroidsset.mediansset.medoidsset.pcsset.u.logStatisticStatistic(a) GSE2034(b) GSE4922vvvv0.80.8AUC0.60.4●●●●●●AUC0.60.4●●●●●●0.20.2rawset.centroidsset.mediansset.medoidsset.pcsset.u.lograwset.centroidsset.mediansset.medoidsset.pcsset.u.logStatisticStatistic(c) GSE6532(d) GSE7390vv0.80.6●●●● ● ●AUC0.40.2rawset.centroidsset.mediansset.medoidsset.pcsset.u.logStatistic(e) GSE11121Figure A.4.: Internal validation (mean and 95% CI for AUC) for VV1 classifier210

A.3. External Validationvv2vv20.80.8AUC0.60.4●● ● ●●●AUC0.60.4●●●● ● ●0.20.2rawset.centroidsset.mediansset.medoidsset.pcsset.u.lograwset.centroidsset.mediansset.medoidsset.pcsset.u.logStatisticStatistic(a) GSE2034(b) GSE4922vv2vv20.80.80.6●●●●●0.6● ● ●● ● ●AUC0.4●AUC0.40.20.2rawset.centroidsset.mediansset.medoidsset.pcsset.u.lograwset.centroidsset.mediansset.medoidsset.pcsset.u.logStatisticStatistic(c) GSE6532(d) GSE7390vv20.80.6●●●●●●AUC0.40.2rawset.centroidsset.mediansset.medoidsset.pcsset.u.logStatistic(e) GSE11121Figure A.5.: Internal validation (mean and 95% CI for AUC) for VV2 classifier211

A. Supplementary Results for Gene Set Statistics0.7●● ● ● ● ● ● ● ● ● ● ●0.7●● ● ● ● ● ● ● ● ● ● ●●●●●AUC0.6AUC0.60.5rawset.centroidsset.mediansset.medoidsset.pcsset.t.statset.u.stat.pval.log●0.5rawset.centroidsset.mediansset.medoidsset.pcsset.t.statset.u.stat.pval.log●81632641282565121024204840965414819216384222151248163264128256512102420484096541481921638422215124FeaturesFeatures(a) Centroid with RFE (point estimate)(b) Centroid with RFE (point + 95% confidenceinterval)0.7●●●0.70AUC0.6AUC●●0.5rawset.centroidsset.medoids●●0.65●rawset.centroidsset.medians●●541422215541422215FeaturesFeatures(c) SVM(d) PAM0.70.7●●●AUC0.6AUC0.6●●0.5●rawset.centroidsset.medoids●●0.5rawset.centroidsset.medoids●●541422215541422215FeaturesFeatures212(e) VV1(f) VV2Figure A.6.: External validation (mean and 95% CI for AUC) for all models

A.3. External Validationset.centroidsset.mediansset.medoidsScore−0.15 −0.05 0.05Score−0.15 −0.05 0.05Score−0.20 −0.10 0.00 0.100 1000 2000 3000 4000 50000 1000 2000 3000 4000 50000 1000 2000 3000 4000 5000Gene set rankGene set rankGene set rankset.pcsset.t.statset.u.stat.pval.logScore−0.15 −0.05 0.05 0.15Score−0.15 −0.05 0.05 0.15Score−0.1 0.0 0.1 0.2 0.3 0.40 1000 2000 3000 4000 50000 1000 2000 3000 40000 1000 2000 3000 4000 5000Gene set rankGene set rankGene set rankER−/HER2−ER+/HER2−HER2+Figure A.7.: Kolmogorov-Smirnov plots for overlap between the gene sets and the modulesof Desmedt et al. (2008).213

A. Supplementary Results for Gene Set StatisticsGATA3 (209604_s_at)GATA3 (209602_s_at)GATA3 (209603_at)FOXA1 (204667_at)ESR1 (205225_at)ESR1 (215552_s_at)ESR1 (211235_s_at)ESR1 (211234_x_at)ESR1 (211233_x_at)ESR1 (217190_x_at)PGR (208305_at)ERBB2 (216836_s_at)ERBB2 (210930_s_at)MKI67 (212021_s_at)MKI67 (212020_s_at)MKI67 (212023_s_at)HMMR (209709_s_at)HMMR (207165_at)AURKA (208079_s_at)AURKA (204092_s_at)MKI67 (212022_s_at)TOP2A (201292_at)TOP2A (201291_s_at)PGR (213227_at)PGR (201701_s_at)PGR (207624_s_at)ERBB2 (217941_s_at)PGR (201121_s_at)AURKA (218580_x_at)PGR (206608_s_at)PGR (213959_s_at)ESR1 (215551_at)AURKA (208080_at)PGR (201120_s_at)ESR1 (217163_at)ESR1 (211627_x_at)420−2−4−6−8SubtypeER−/HER2−ER+/HER2−HER2+Figure A.8.: Heatmap of selected genes in the combined dataset (932 samples), showing thethree subclasses ER−/HER2−, ER+/HER2−, and HER2+.214

A.3. External ValidationEstimate Std. Error t value Pr(>∣t∣)(Intercept) 0.6545 0.0116 56.63 0.0000set.centroids -0.0103 0.0163 -0.63 0.5301set.medians -0.0083 0.0163 -0.51 0.6110set.medoids -0.0084 0.0163 -0.52 0.6073set.pcs -0.0290 0.0163 -1.77 0.0784set.t.stat 0.0199 0.0163 1.22 0.2263set.u.stat.pval.log -0.1615 0.0163 -9.88 0.0000(a) 1 feature/s, ANOVA F -test p < 2.22 × 10 −16Estimate Std. Error t value Pr(>∣t∣)(Intercept) 0.6836 0.0103 66.11 0.0000set.centroids 0.0093 0.0146 0.64 0.5260set.medians 0.0110 0.0146 0.75 0.4548set.medoids 0.0067 0.0146 0.46 0.6452set.pcs -0.0684 0.0146 -4.68 0.0000set.t.stat -0.0035 0.0146 -0.24 0.8134set.u.stat.pval.log -0.1853 0.0146 -12.67 0.0000(b) 8 feature/s, ANOVA F -test p < 2.22 × 10 −16Estimate Std. Error t value Pr(>∣t∣)(Intercept) 0.6982 0.0098 71.46 0.0000set.centroids -0.0095 0.0138 -0.68 0.4947set.medians -0.0121 0.0138 -0.88 0.3813set.medoids -0.0122 0.0138 -0.88 0.3781set.pcs -0.0637 0.0138 -4.61 0.0000set.t.stat -0.0180 0.0138 -1.30 0.1957set.u.stat.pval.log -0.1910 0.0138 -13.83 0.0000(c) 64 feature/s, ANOVA F -test p < 2.22 × 10 −16Estimate Std. Error t value Pr(>∣t∣)(Intercept) 0.6989 0.0100 70.22 0.0000set.centroids -0.0187 0.0141 -1.33 0.1873set.medians -0.0174 0.0141 -1.24 0.2187set.medoids -0.0212 0.0141 -1.50 0.1353set.pcs -0.0775 0.0141 -5.51 0.0000set.t.stat -0.0366 0.0141 -2.60 0.0103set.u.stat.pval.log -0.1799 0.0141 -12.78 0.0000(d) 4096 feature/s, ANOVA F -test p < 2.22 × 10 −16Table A.1.: of external-validation AUC different numbers of features. The AUC for individualgenes is used as the intercept.215

BSupplementary Results for Sparse LinearModelsB.1. Scoring Measures for Causal SNP DetectionŶ = 1 Ŷ = 0True TP FNFalse FP TNTable B.1.: The confusion matrix of predicted versus actual classes. “True” is truly causalSNPs, “False” is non-causal SNPs, Ŷ = 1 and Ŷ = 0 are predictions of causal andnon-causal SNPs, respectively.Binary classification is usually evaluated through the confusion matrix, the cross-tabulationof predicted class Ŷ versus actual class, as shown in Table B.1. Common statistics derivedfrom the confusion table include:• Sensitivity, recall, true positive rate (TPR) = TP / (TP + FN)• False positive rate (FPR) = FP / (FP + TN)• Precision = TP / (TP + FP)The statistics are evaluated over different cutoffs of the classifier’s output. Receiver operatingcharacteristic (ROC) curves are the curves induced by plotting the (TPR, FPR) pairs for217

B. Supplementary Results for Sparse Linear Modelsall cutoff values. The ROC curves can be summarised using the Area Under the receiver operatingcharacteristic Curve (AUC) (Hanley and McNeil, 1982), computed through numericalintegration or alternatively estimated asÂUC =N1+N + N − ∑N −∑ {I(ŷ i > ŷ j ) + 1 2 I(ŷ i = ŷ j )} ,i=1 j=1(B.1)where N + and N − are the number of cases and controls respectively, ŷ i is ith the predictionfor the ith case, ŷ j is the prediction for the jth control, and I(⋅) is the indicator functionevaluating to 1 if its argument is true and to zero otherwise. Eq. B.1 shows that the sampleAUC is the maximum likelihood estimate of the probability of correctly ranking a randomlyselectedcausal SNP more highly than a randomly-selected non-causal SNP (with correctionfor ties). The expected AUC for a classifier producing random predictions is 0.5, perfectpredictions have AUC = 1.0 and perfectly-wrong predictions have an AUC = 0.0.Another useful statistic is the Area under the Precision-Recall Curve (APRC, also knownas Average Precision), which can be integrated numerically, but is usually approximated asÂPRC = 1 MM∑ Prec m ,m=1(B.2)where Prec m is the precision for the mth level of recall, out of M levels. The expectedAPRC for a classifier producing random predictions is the proportion of positive samples. Forestimating APRC, we used the program perf (http://kodiak.cs.cornell.edu/kddcup/software.html).Unlike the APRC, the AUC does not depend on the relative proportions of the classes(the class balance). However, the AUC, as commonly used, can be misleading when usedfor comparing classifiers when the proportion of causal SNPs is very small, as is the case inGWA. Consider our HAPGEN simulations, where only 148 of the 73, 832 SNPs are true causalSNPs. To see why AUC is not informative in these settings, consider a thought experimentsimilar to that used by Sonnenburg et al. (2006), where we have a classifier that at some cutoffcorrectly classifies 100% of the true causal SNPs (TPR=1), but also wrongly classifies 1%of the non-causal SNPs (FPR=0.01). The AUC is the area under the curve induced by theTPR and the FPR, and is monotonically increasing. Therefore, the AUC in this case mustbe ≥ 0.99, which seems like very good discrimination. However, when there are 73 684 noncausalSNPs, even the low false positive rate of 1% implies 0.01 × 73 684 ≈ 737 false positiveson average. In comparison, even assuming a fixed recall (=TPR) of 1, so that the APRCis equal to the precision, then the number of false positives needs to be as low as 148 (thenumber of causal SNPs) for the precision and APRC to be 0.5, and conversely, with a falsepositive rate of just 0.5% leading to ∼ 368 false positives on average, both the precision andthe APRC drop to 148/(148 + 368) = 0.287. In many real world settings, such extreme results218

B.2. Real Dataare highly unlikely — usually, the recall/TPR is lower than 1 and the FPR is higher too, inwhich case the APRC is even lower.Clearly, the APRC is more sensitive to the number of false positives than the AUC, andthis is an important consideration when screening for SNPs in GWA since we would like tokeep the number of false positives low as biological validation of candidates is costly and timeconsuming.B.2. Real DataB.2.1. Checking for StratificationWe used the genomic inflation factor λ and principal component analysis (PCA) to assesspopulation structure and cryptic relatedness in the celiac disease and WTCCC datasets.Genomic Inflation FactorsGenomic inflation factors λ for the discovery datasets are shown in Table B.2, for both the1-df allelic χ 2 test and the logistic regression genotypic test.Dataset λ 1-df λ logisticBD 1.186 1.0CAD 1.140 1.0Crohn 1.181 1.0Celiac1 1.051 1.051Celiac2-UK 1.056 1.056HT 1.131 1.0RA 1.111 1.0T1D 1.120 1.0T2D 1.150 1.0Table B.2.: Genomic inflation factors λ estimated by PLINK v1.07 using the median of statisticsfor either the 1-df χ 2 test (--assoc --adjust) or the logistic regression test(without covariates, --logistic --adjust).Principal Component AnalysisWe used smartpca from EIGENSOFT v4.0beta (Price et al., 2006) to estimate the principalcomponents of the genotypes for each dataset. It is known that regions of high LD, suchas the major histocompatibility complex (MHC) region on chr6, can produce clusters in theprincipal components, and can be misinterpreted as ancestral stratification (Patterson et al.,2006). We expect that population effects would show reasonably uniformly across the SNPs,219

B. Supplementary Results for Sparse Linear Modelswhereas effects due to LD would be localised to certain regions. Therefore, we used an LDthinningapproach similar to that of Fellay et al. (2009), namely1. Removed high-LD regions from the data, including chr5: 44Mb–51.5Mb, chr6: 25Mb–33.5Mb, chr8: 8Mb–12Mb, and chr11: 45Mb–57Mb.2. Thinned the remaining SNPs using PLINK --indep-pairwise using a window size of1500 SNPs, step size of 150, and r 2 ≤ 0.2.3. In smartpca, regressed each SNP on the previous 5 SNPs (nsnpldregress option), andremoved outliers.4. Inspection of the PCA loadings of each SNP for each PC, identifying whether someregions contribute more to each PC, for the original data and the LD-pruned data.Any stratification remaining after this filtering is more likely to be due to true populationdifferences than to LD effects, and if the remaining stratification is strong there it suggeststhat the original dataset is stratified due to population structure.The top 5 PCs are shown in Figure B.3, for the original Celiac1 dataset (all autosomalSNPs) and for the LD-thinned version. (We examined the top 10 PCs however PCs beyond5 did not show strong association with the phenotype). There was strong stratification inthe original dataset, and the PCs were strongly predictive of the case/control status. Thetop PCs were predictive of case/control status in the original Celiac1 dataset (AUC ∼ 0.8);after LD-pruning, the AUC dropped to ∼ 0.54 (Figure B.5), indicating that the bulk of thepredictive ability had been removed.Boxplots of the PC loadings for each chromosome (Figure B.4) show that in the originalCeliac1 data, chromosome 6 and chromosome 8 have vastly more SNPs with large loadingsin the 1st and 2nd PCs, respectively. Other chromosome with large contributions are chromosome5 (PC4 and PC5) chromosome 11 (PC4, PC6, PC9, and PC10). In comparison, thecontributions in the LD-thinned data are much more uniform across the chromosomes, asexpected when there is no population stratification.B.2.2. AUC for Stringent FilteringWe prepared stringently-filtered versions of the Celiac1, Celiac2-UK, and WTCCC-T1Ddatasets, following the procedure in Lee et al. (2011). The datasets were filtered in PLINKby the following criteria. SNPs were removed if they had• MAF < 0.01 (--maf),• missingness > 0.05 (--geno)• deviation from Hardy-Weinberg equilibrium in controls (--hardy) p < 0.05220

B.2. Real Data• differential missingness (--test-missing) between cases and controls with p < 0.05,• two-locus test (Lee et al., 2010) p < 0.05Samples were removed if they had• missingness > 0.01 (--mind)• both samples in each pair with relatedness ˆπ > 0.05 were removed (--genome)Post-filtering, the datasets contained 2109 and 6613 samples for Celiac1 and Celiac2-UKrespectively, and 279,312 and 471,191 SNPs for Celiac1 and Celiac2-UK respectively.WTCCC-T1D, the stringent dataset had 4901 samples (no samples removed) and 370,280SNPs.AUC for cross-validation in stringently-filtered Celiac1, Celiac2-UK, and WTCCC-T1Ddatasets are shown in Figure B.6.The assumed population prevalence for each disease is shown in Table B.3.Disease Prevalence K (%) SourceBD 1 (Bebbington and Ramana, 1995; Wray et al., 2010)CAD 5.6 (Wray et al., 2010)Celiac 1 (van Heel and West, 2006)Crohn’s/IBD 0.1 (Carter et al., 2004; Wray et al., 2010)HT 13.1 (NHS, 2010)RA 0.75 (Wray et al., 2010)T1D 0.54 (Wray et al., 2010)T2D 3 (Wray et al., 2010)Table B.3.: Population prevalence for each disease as used in this work.ForB.2.3. PPV/NPVApart from AUC, we examined several measures of classification performance: sensitivity/specificity(ROC curve), precision/recall, and PPV/NPV. Examining the individual plotscan show anomalies in the data that are obscured when only considering aggregate statisticssuch as the AUC.We examined the results for one cross-validation fold for the WTCCC-T1D dataset (Fig. B.9),the Celiac1 dataset both original and after stringent filtering (Fig. B.10 and Fig. B.11), andthe Celiac2-UK dataset for both original and stringent filtering (Fig. B.12 and Fig. B.13).Most datasets had a small number of samples that were highly ranked as disease in terms ofthe classifier’s confidence—for the squared hinge loss classifier this means high positive valuesof the linear predictor l. In turn, this is reflected in the plots as Specificity=1, Precision=1,and PPV=1 on the left hand side of the curves.221

B. Supplementary Results for Sparse Linear ModelsB.2.4. Comparison with Other MethodsWe compared the lasso squared hinge loss model with two other approaches: multivariablelogistic regression and GCTA (Yang et al., 2011).The logistic regression method included: (1) screening of the SNPs using univariable logisticregression and selection of SNPs based on p-value; (2) SNP thinning based on LD to reducemulticollinearity; and (3) fitting an unpenalised multivariable logistic regression to the selectedSNPs. The entire logistic regression was repeated within the cross-validation loop (p-valueswere computed only on training data for each fold), using the same training and testing datasplits as employed by the lasso models.Compared with the 3-step logistic regression, the lasso models achieved equivalent or betterAUC across all datasets, using the same number of SNPs in both models. Specifically, lassohad higher AUC in BD, Celiac1/Celiac2, Crohn, RA, and T2D, and close to equal performancein CAD, HT, and WTCCC-T1D. To evaluate model fitting, we compared the performanceof the lasso to 3-step logistic regression on subsamples of the WTCCC-T1D dataset (FigureB.15). The results indicate that the lasso-based method was more stable across a rangeof model sizes. In comparison, the 3-step logistic regression was more sensitive to choice ofmodel size than the lasso-based method. Two possible explanations are that the univariablescreening is discarding potentially predictive SNPs, in comparison to the lasso that considersall SNPs concurrently, and that the heavy penalisation inherent in the lasso approach makesit more resistant to overfitting, especially for smaller sample sizes.In addition to logistic regression, we also considered a second method for multivariablemodeling, namely GCTA (Yang et al., 2011) which was used to fit a multivariable mixedeffectlinear model to all autosomal SNPs in each dataset. GCTA produces estimates of theexplained phenotypic variance in the data, however estimates of variance validated on independenttest data tend to be smaller than those evaluated within the same data used toderive the estimates (Makowsky et al., 2011). Therefore, we used cross-validation to producean estimate of explained variance. In each cross-validation fold, we fitted the GCTA modelsto the training data, then derived GCTA’s SNP score (BLUP). The SNP scores were usedto estimate a per-sample risk score for independent test data, and the risk scores were usedto rank the samples in the AUC calculation. We then used the AUC from the test data,together with estimated population prevalence, to estimate the proportion of explained phenotypicvariance in the test data (Table B.4). Lasso models achieved higher AUC than GCTAfor three datasets: GCTA models produced AUC of approximately 0.81, 0.72, and 0.65 forCeliac1/Celiac2-UK, T1D, and RA, respectively, whereas the corresponding cross-validationAUC for lasso was 0.88, 0.88, and 0.74 for Celiac1/Celiac2-UK, T1D, and RA respectively.For Crohn, lasso produced AUC of 0.70 and for GCTA was slightly lower at AUC = 0.65. Forthe remainder of the datasets (BD, CAD, HT, and T2D), GCTA produced AUC similar tothe lasso models. Correspondingly, GCTA models explained 19%–20% of phenotypic variance222

B.2. Real Datafor Celiac1/Celiac2-UK, while in other diseases not more than 10% were explained, includingWTCCC-T1D. In comparison, in cross-validation lasso explained up to 32% of the phenotypicvariance for Celiac1/Celiac2-UK (21%–38% in independent replication), and up to 28% of thephenotypic variance for WTCCC-T1D (22% in GoKinD-T1D replication).K AUC VarExp NMean 95% LCL 95% UCL Mean 95% LCL 95% UCLBD 0.010 0.672 0.667 0.678 0.053 0.050 0.057 9CAD 0.056 0.619 0.610 0.629 0.038 0.032 0.045 9Celiac1 0.010 0.817 0.813 0.820 0.202 0.197 0.208 60Celiac2-UK 0.010 0.808 0.805 0.811 0.190 0.186 0.195 30Crohn’s 0.001 0.649 0.639 0.660 0.026 0.022 0.029 9HT 0.131 0.620 0.611 0.628 0.047 0.041 0.054 9RA 0.008 0.648 0.639 0.657 0.036 0.032 0.041 9WTCCC-T1D 0.005 0.720 0.716 0.723 0.078 0.075 0.081 30T2D 0.030 0.632 0.623 0.641 0.040 0.035 0.046 9Table B.4.: AUC and proportion of phenotypic variance explained for GCTA (Yang et al.,2011), using 3-fold cross-validation (CV). AUC was derived from the per-samplescores in the test folds for each cross-validation fold. The 95% confidence intervalis from a one-sample t-test, and explained variance (including the confidence intervals)is estimated from the AUC and prevalence K using the method of Wrayet al. (2010). The column denoted N is the number of AUC values estimated incross-validation — each 3CV produces N = 3 AUC values.RandomForest and Gradient Boosting MachinesWe evaluated the case/control predictive performance of RandomForest (implemented in the Rpackage randomForest, Liaw and Wiener (2002)) and Gradient Boosting Machines (GBM) (Rpackage gbm, Ridgeway (1999, 2012)). Due to the computational overhead of these methodsin time and space, we could not run these methods on the entire SNP datasets. Instead, weran them in 10 × 10-fold cross-validation on chr6 of the Celiac1 dataset (2200 samples, 19,169SNPs). For the RandomForest, we used an m try = 2 with 500 or 5000 trees. For GBM,we used logistic regression (bernoulli loss) as a base learner, an interaction depth of 3, 500or 5000 trees, and shrinkage of 0.001. For SparSNP, we used lasso model with a grid of 50penalties.Overall, results for SparSNP were similar to those achieved by GBM and substantially bettterthan RandomForest using 500 trees. GBM with 5000 trees had best overall performance,with a small predictive advantage over SparSNP and a larger advantage over RandomForesteither 500 or 5000 trees (Table B.5). The predictive advantages of GBM with 5000 trees223

B. Supplementary Results for Sparse Linear Modelsare potentially both due to the large number of trees used and the fact that GBM took intoaccount 2-way and 3-way interactions, whereas SparSNP and RandomForest did not.As randomForest and gbm are R packages, they are limited in the amount of SNP datathat can be analysed, since all data must be loaded into RAM. For autoimmune diseases suchas celiac disease, using a small number of SNPs from chr6 is sufficient, and such models canbe readily fitted. However, for other diseases like CAD or T2D, the signal appears to be morespread across the genome, and analysing all SNPs would be preferable, as done by SparSNP.Apart from memory requirements, both randomForest and gbm are considerably slower thanSparSNP (using the settings above), taking several hours to complete an analysis of chr6 forthe models with 500 trees, compared with several minutes for SparSNP, when each methodwas run in parallel on the same 10 core machine.Method AUC 95% LCL 95% UCLSparSNP 0.889 0.887 0.891GBM (500 trees) 0.883 0.872 0.894GBM (5000 trees) 0.895 0.883 0.907RandomForest (500 trees) 0.805 0.775 0.834RandomForest (5000 trees) 0.823 0.800 0.844Table B.5.: AUC estimated in 10×10-fold cross-validation on chr6 in the Celiac1 dataset (2200samples, 19,169 SNPs).224

B.2. Real Data1005000.80.60.40.22500+5000++0.80.60.40.210000++20000++APRC0.80.60.40.250000++100000++0.80.60.40.2++++124816326412825651210242048univariable124816Number of SNPs with non−zero effects326412825651210242048univariablelasso + univariableRisk Ratio● 2.0 ● 1.5 ● 1.4 ● 1.3 ● 1.2 ● 1.1Figure B.1.: APRC for HAPGEN simulations, using either lasso squared-hinge loss models(lasso) or the univariable logistic regression Wald test (univariable). For thelasso, different numbers of SNPs are allowed in the model, as determined by thepenalty λ. For the univariable test, all SNPs are considered. For lasso, resultsare smoothed using LOESS over the replications. For univariable, results areaveraged over the replications. The dotted vertical lines show the number oftrue “causal” SNPs in the data.225

B. Supplementary Results for Sparse Linear Models0.90.80.70.60.5100+++500+++250050000.90.80.70.60.5++++AUC0.90.80.70.60.510000++20000+0.90.80.70.60.550000+100000+124816326412825651210242048univariable124816Number of SNPs with non−zero effects326412825651210242048univariablelasso + univariableRisk Ratio● 2.0 ● 1.5 ● 1.4 ● 1.3 ● 1.2 ● 1.1Figure B.2.: AUC for HAPGEN simulations, using either lasso squared-hinge loss models(lasso) or the univariable logistic regression Wald test (univariable). For thelasso, different numbers of SNPs are allowed in the model, as determined by thepenalty λ. For the univariable test, all SNPs are considered. For lasso, resultsare smoothed using LOESS over the replications. For univariable, results areaveraged over the replications. The dotted vertical lines show the number oftrue “causal” SNPs in the data.226

B.2. Real Data(a) Original Celiac1 data(b) Thinned Celiac1 dataFigure B.3.: The first 5 principal components (PCs) of (a) original Celiac1 data and (b) afterremoving high LD regions, thinning, and regression of previous SNPs. The strongstructure in the top PCs is largely removed by accounting for LD. PCs 6–10 wereonly weakly predictive of the phenotype and are not shown for clarity.227

B.2. Real Data10 10 10 10 9 8 7 7 7 6 4 3 1 1 1 1 1 1 110 8 7 7 7 7 7 7 7 7 5 4 4 4 3 3 2 0AUC0.50 0.55 0.60 0.65 0.70 0.75 0.80● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●AUC0.50 0.55 0.60 0.65 0.70 0.75 0.80● ● ● ● ● ● ● ●● ● ● ●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●−7 −6 −5 −4 −3 −2−6.5 −6.0 −5.5 −5.0 −4.5 −4.0 −3.5log(Lambda)log(Lambda)(a) Original Celiac1 data(b) Pruned Celiac1 dataFigure B.5.: 10-fold cross-validated AUC for prediction of case/control status from the top10 principal components of the Celiac1 dataset, using lasso logistic regressionwith glmnet (Friedman et al., 2010), selecting increasing number of principalcomponents (right to left) (a) original dataset and (b) after LD-pruning.229

B. Supplementary Results for Sparse Linear Models0.880.860.84AUC0.82DatasetCeliac1Celiac2−UK0.800.781 2 4 8 16 32 64 128 256 512 1024 2048Number of SNPs with non−zero effects(a) Celiac disease0.880.860.84AUC0.820.800.781 2 4 8 16 32 64 128 256 512 1024 2048Number of SNPs with non−zero effects(b) WTCCC-T1DFigure B.6.: LOESS-smoothed (with 95% pointwise confidence intervals about the mean)AUC for lasso models of stringently-filtered (a) Celiac1 and Celiac2-UK and(b) WTCCC-T1D, both in 30 × 3-fold cross-validation.230

B.2. Real Data0.9AUC0.80.70.6DatasetBDCADCeliac1Celiac2−UKCrohnHTRAT1DT2D0.51 2 4 8 16 32 64 128 256 512 1024 2048Number of SNPs with non−zero effects(a) lasso squared-hinge lossAUC0.80.70.6DatasetBDCADCeliac1Celiac2−UKCrohnHTRAT1DT2D0.51 2 4 8 16 32 64 128 256 512 1024 2048Number of SNPs with non−zero effects(b) logistic regressionFigure B.7.: LOESS-smoothed AUC for models in 20×3-fold cross-validation.231

B. Supplementary Results for Sparse Linear Models1.00.8PPV0.60.4●●●● ●●●Dataset● BDCADCeliac1Celiac2−UKCrohnHTRAT1DT2D●0.2●●●●●●0.88 0.90 0.92 0.94 0.96 0.98NPV(a) lasso squared-hinge loss1.00.8PPV0.60.4●●●Dataset● BDCADCeliac1Celiac2−UKCrohnHTRAT1DT2D0.2●● ● ● ● ● ●●0.88 0.90 0.92 0.94 0.96 0.98NPV(b) logistic regression232Figure B.8.: Averaged PPV/NPV for models in 20×3-fold cross-validation.

B.2. Real DataT1DSpecificity0.0 0.2 0.4 0.6 0.8 1.0Precision0.4 0.6 0.8 1.00.0 0.2 0.4 0.6 0.8 1.0Sensitivity0.0 0.2 0.4 0.6 0.8 1.0Recallppv0.0 0.2 0.4 0.6 0.8 1.0ppv0.0 0.2 0.4 0.6 0.8 1.0● ● ●●●●●●● ●●●●●●●● ●●●0.990 0.994 0.998npv1 5 50 500IndexFigure B.9.: Summary plots of one fold of cross-validation prediction in the WTCCC-T1Ddata. The fourth panel shows the PPV in rank order of NPV, to better highlightthe samples with PPV=1.233

B. Supplementary Results for Sparse Linear ModelsCD1Specificity0.0 0.2 0.4 0.6 0.8 1.0Precision0.4 0.6 0.8 1.00.0 0.2 0.4 0.6 0.8 1.0Sensitivity0.0 0.2 0.4 0.6 0.8 1.0Recallppv0.0 0.2 0.4 0.6 0.8 1.0ppv0.0 0.2 0.4 0.6 0.8 1.0●● ●●●●●●●●●●●●●●●●● ●●0.990 0.994 0.998npv1 2 5 10 50 200IndexFigure B.10.: Summary plots of one fold of cross-validation prediction in the Celiac1 data.The fourth panel shows the PPV in rank order of NPV, to better highlight thesamples with PPV=1.234

B.2. Real DataCD1_stringentSpecificity0.0 0.2 0.4 0.6 0.8 1.0Precision0.4 0.6 0.8 1.00.0 0.2 0.4 0.6 0.8 1.0Sensitivity0.0 0.2 0.4 0.6 0.8 1.0Recallppv0.0 0.2 0.4 0.6 0.8 1.0ppv0.0 0.2 0.4 0.6 0.8 1.0●● ●●●●●●●●●●●●● ●●0.990 0.994 0.998npv1 2 5 10 50 200IndexFigure B.11.: Summary plots of one fold of cross-validation prediction in the Celiac1 data afterstringent filtering. The fourth panel shows the PPV in rank order of NPV, tobetter highlight the samples with PPV=1.235

B. Supplementary Results for Sparse Linear ModelsCD2Specificity0.0 0.2 0.4 0.6 0.8 1.0Precision0.3 0.5 0.7 0.90.0 0.2 0.4 0.6 0.8 1.0Sensitivity0.0 0.2 0.4 0.6 0.8 1.0Recallppv0.0 0.2 0.4 0.6 0.8 1.0ppv0.0 0.2 0.4 0.6 0.8 1.0● ● ●●●●●●● ●●●● ●●● ●● ●●●●0.990 0.994 0.998npv1 5 50 500IndexFigure B.12.: Summary plots of one fold of cross-validation prediction in the Celiac2-UK data.The fourth panel shows the PPV in rank order of NPV, to better highlight thesamples with PPV=1.236

B.2. Real DataCD2_stringentSpecificity0.0 0.2 0.4 0.6 0.8 1.0Precision0.4 0.6 0.8 1.00.0 0.2 0.4 0.6 0.8 1.0Sensitivity0.0 0.2 0.4 0.6 0.8 1.0Recallppv0.0 0.2 0.4 0.6 0.8 1.0ppv0.0 0.2 0.4 0.6 0.8 1.0● ●● ● ●●●●●●● ●●●0.990 0.994 0.998npv1 5 50 500IndexFigure B.13.: Summary plots of one fold of cross-validation prediction in the Celiac2-UK dataafter stringent filtering. The fourth panel shows the PPV in rank order of NPV,to better highlight the samples with PPV=1.237

B. Supplementary Results for Sparse Linear Models0.350.30VarExp0.250.200.150.10DatasetBDCADCeliac1Celiac2−UKCrohnHTRAT1DT2D0.051 2 4 8 16 32 64 128 256 512 1024 2048Number of SNPs with non−zero effects(a) lasso squared-hinge loss0.350.30VarExp0.250.200.150.10DatasetBDCADCeliac1Celiac2−UKCrohnHTRAT1DT2D0.050.001 2 4 8 16 32 64 128 256 512 1024 2048Number of SNPs with non−zero effects(b) logistic regressionFigure B.14.: LOESS-smoothed proportion explained phenotypic variance, over 20×3-foldcross-validation.238

B.2. Real DataAUC0.80.60.40.80.60.40.80.60.42 0 2 2 2 2 8 2 10501002002 0 2 2 2 4 2 6 2 8 2 10 2 0 2 2 2 4 2 6 2 8 2 10 4 2 64008001600Methodlassologis32004901Number of SNPs with non−zero effectsFigure B.15.: LOESS-smoothed AUC for lasso squared-hinge loss classifier and logistic regressionfor random subsamples of the T1D data. For each prespecified sizeN ∈ {50, 100, 200, 400, 800, 1600, 3200}, we randomly sampled the original 4901samples (without replacement) to form a smaller dataset. The subsampling wasrepeated 30 times for N = 50, 20 times for N = 100, and 10 times for the rest.Within each subsampled dataset, we ran 10 × 3CV to evaluate the AUC (forexample, 30 × 10 × 3CV for N = 50). For N = 4901, we used the original datasetwithout sampling, running 20 × 3CV.239

B. Supplementary Results for Sparse Linear ModelsB.2.5. Principal Component Analysis of Cases(a) Celiac1(b) Celiac1 stringent filtering(c) Celiac2-UK(d) Celiac2-UK stringent filteringFigure B.16.: Principal Component Analysis (PCA) of the cases only, using the top 100SNPs identified by the lasso for the Celiac1 and Celiac2-UK datasets, andtheir stringently-filtered versions. Samples are colored by median specificityin the cross-validation replications: median specificity≥ 0.99 (red), and the rest(black).240

B.2. Real DataFigure B.17.: Principal Component Analysis (PCA) of the cases only, using the top 100 SNPsidentified by the lasso for T1D. Samples are colored by median specificity inthe cross-validation replications: median specificity≥ 0.99 (red), and the rest(black).241

B. Supplementary Results for Sparse Linear ModelsB.3. Results for each datasetFor each disease, we show results over 20 × 3-fold cross-validation• AUC versus the number of SNPs with non-zero effects;• PPV versus NPV, using the population prevalence specified for each disease;• Explained phenotypic variance versus the number of SNPs with non-zero effects, basedon the prevalence specified for each disease;• Top SNPs selected to be in the model. The SNPs were ranked based on the proportionof times they are non-zero in a model times the inverse size of the model, over allcross-validation replications.For several datasets, we also evaluated two subsets of the data: “MHC”, which is all SNPson the major-histocompatibility (MHC) region on chr6 (29.7Mb–33.3Mb), and “−MHC”,which is all autosomal SNPs outside the MHC.242

B.3. Results for each datasetB.3.1. Bipolar Disorder (BD)Cases Controls SNPs Platform Autoimmune Prevalence1868 2938 459 012 ∗ Affy SNP 5.0 No 1%Table B.6.: BD dataset, autosomes only. Prevalence from Bebbington and Ramana (1995);Wray et al. (2010).BD0.65AUC0.600.55SubsetMethodAll:lassoAll:logistic0.50PPV0.70.60.50.40.30.20.11 2 4 8 16 32 64 128 256 512 1024 2048Number of SNPs with non−zero effects●●●● ●●●●(a) AUC●●●● ●● ●●●●●●●●● ● ● ●0.9910.9920.9930.9940.9950.9960.997NPV●Method●lassologistic(b) PPV/NPV, based on models with 1000–2000 nonzeroSNPs for lasso and 200–300 for logistic regression.1. rs28359272. rs101637343. rs9878234. rs21473845. rs169783866. rs28030747. rs28368308. rs7476409. rs576643810. rs155822111. rs1701656612. rs268399213. rs93180114. rs134583215. rs4152144716. rs673664117. rs651225718. rs1015247819. rs490047620. rs701522321. rs29581322. rs934484423. rs193656624. rs445710025. rs671900926. rs784551027. rs236908828. rs1098165029. rs800786530. rs8090696SNPWeightedProportion0.00.20.40.60.81.00.050.04BD123455−1010−1515−2020−2525−3030−35# SNPs with non−zero effects(d) Top SNPs35−4040−45VarExp0.030.02SubsetMethodAll:lassoAll:logistic0.010.001 2 4 8 16 32 64 128 256 512 1024 2048Number of SNPs with non−zero effects(c) Explained phenotypic variance.Figure B.18.: AUC, PPV/NPV, and explained phenotypic variance for Bipolar Disease.243

B. Supplementary Results for Sparse Linear ModelsB.3.2. Coronary Artery Disease (CAD)Cases Controls SNPs Platform Autoimmune Prevalence1926 2938 459 012 ∗ Affy SNP 5.0 No 5.6%Table B.7.: CAD dataset, autosomes only. Prevalence from Wray et al. (2010).CAD0.660.640.62AUC0.600.58SubsetMethodAll:lassoAll:logistic0.560.54PPV1.00.80.60.40.21 2 4 8 16 32 64 128 256 512 1024 2048Number of SNPs with non−zero effects●●●●●●● ● ●●●(a) AUC●● ●●●●●●●●●●●●●●●●● ● ● ●0.95 0.96 0.97 0.98NPVMethod●lassologistic(b) PPV/NPV, based on models with 1000–2000 nonzeroSNPs for lasso and 200–300 for logistic regression.1. rs107385692. rs40580613. rs7411164. rs131530715. rs3524736. rs169091827. rs101864078. rs20393909. rs1047045610. rs473339611. rs468157712. rs613260713. rs1213586214. rs1693815615. rs209614116. rs39157817. rs1050245018. rs600443019. rs1701656620. rs1702690721. rs1337881022. rs1051168623. rs644483124. rs200923425. rs1713767226. rs756425327. rs1074024428. rs1096474829. rs1702486230. rs10492044SNPWeightedProportion0.00.20.40.60.81.00.07CAD123455−1010−1515−2020−25# SNPs with non−zero effects25−3030−3535−400.06(d) Top SNPs0.05VarExp0.040.03SubsetMethodAll:lassoAll:logistic0.020.011 2 4 8 16 32 64 128 256 512 1024 2048Number of SNPs with non−zero effects(c) Explained phenotypic variance.Figure B.19.: AUC, PPV/NPV, and explained phenotypic variance for Coronary Artery Disease.244

B.3. Results for each datasetB.3.3. Celiac Disease (Celiac)Cases Controls SNPs Platform Autoimmune Prevalence1 778 1422 301 689 ∗ Illumina HumanHap300 Yes 1%2 1849 4936 516 504 ∗ Illumina HumanHap550 Yes 1%Table B.8.: Celiac datasets, autosomes only. Prevalence from van Heel and West (2006).0.9Celiac10.8AUC0.70.6SubsetMethodAll:lassoAll:logisticMHC:lassoMHC:logistic−MHC:lasso−MHC:logistic0.5PPV1.00.80.60.40.21 2 4 8 16 32 64 128 256 512 1024 2048Number of SNPs with non−zero effects●●●●●●●●(a) AUC●● ●●●●●●●●●●● ● ●● ●●●●●●●●●●●●●●●●●●●●0.992 0.994 0.996 0.998NPVMethod●lassologistic(b) PPV/NPV, using models with 50–100 non-zeroSNPs1. * rs21876682. * rs31313793. * rs2049914. * rs2049995. * rs20501896. * rs5587027. * rs93571528. * rs28490159. * rs645761710. * rs775651611. * rs285670512. * rs777495413. * rs927755414. * rs40489015. * rs311723016. * rs313479217. * rs276397918. * rs777522819. * rs387146620. * rs285402821. rs676274322. * rs312976323. rs937969224. rs1311972325. rs149944726. rs1050560427. rs287941428. rs724527729. rs673514130. rs1382601SNPWeightedProportion0.00.20.40.60.81.0Celiac1# SNPs with non−zero effectsVarExp0.30.20.1SubsetMethodAll:lassoAll:logisticMHC:lassoMHC:logistic−MHC:lasso−MHC:logistic123455−1010−1515−2020−2525−30(d) Top SNPs0.01 2 4 8 16 32 64 128 256 512 1024 2048Number of SNPs with non−zero effects(c) Explained phenotypic variance.Figure B.20.: AUC, PPV/NPV, and explained phenotypic variance for Celiac1.245

B. Supplementary Results for Sparse Linear ModelsCeliac2−UK0.85AUC0.800.750.700.65SubsetMethodAll:lassoAll:logisticMHC:lassoMHC:logistic−MHC:lasso−MHC:logistic0.60PPV0.8 ●0.60.40.21 2 4 8 16 32 64 128 256 512 1024 2048Number of SNPs with non−zero effects●●●● ●●●●(a) AUC● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●0.992 0.994 0.996 0.998NPVMethod●lassologistic(b) PPV/NPV, using models with 50–100 non-zeroSNPs for lasso and 200–300 for logistic regression1. * rs21876682. * rs17942823. * rs2049994. * rs9360505. * rs2049906. * rs31299627. * rs92752248. * rs30998449. * rs284901510. * rs775651611. * rs20499112. * rs285830813. * rs935715214. * rs106335515. * rs42991616. * rs776227917. * rs285699718. rs598292619. * rs312976320. rs374739321. * rs311723022. rs985196723. * rs1721151024. * rs206447825. rs155981026. * rs221989327. * rs776537928. * rs205018929. rs232783230. rs2162610SNPWeightedProportion0.00.20.40.60.81.0Celiac2−UKVarExp0.350.300.250.200.150.10SubsetMethodAll:lassoAll:logisticMHC:lassoMHC:logistic−MHC:lasso−MHC:logistic# SNPs with non−zero effects123455−1010−1515−20(d) Top SNPs20−2525−300.051 2 4 8 16 32 64 128 256 512 1024 2048Number of SNPs with non−zero effects(c) Explained phenotypic variance.Figure B.21.: AUC, PPV/NPV, and explained phenotypic variance for Celiac2-UK.246

B.3. Results for each datasetB.3.4. Crohn’s Disease/Inflammatory Bowel Disease (Crohn’s)Cases Controls SNPs Platform Autoimmune Prevalence1748 2938 459 012 ∗ Affy SNP 5.0 Yes 0.1%Table B.9.: Crohn’s dataset, autosomes only. Prevalence from Carter et al. (2004); Wray etal. (2010).Crohn0.70AUC0.650.600.55SubsetMethodAll:lassoAll:logisticMHC:lassoMHC:logistic−MHC:lasso−MHC:logisticPPV1.00.80.60.40.21 2 4 8 16 32 64 128 256 512 1024 2048Number of SNPs with non−zero effects●●●●●●●●●●(a) AUC● ●●●●●●●●●●●●●●●● ●0.9992 0.9994 0.9996 0.9998NPV●Method●lassologistic(b) PPV/NPV, based on models with 500–1000 nonzeroSNPs for lasso and 150–200 for logistic regression.1. rs169460182. rs99336943. rs112089934. rs14775. rs29393376. rs131177447. rs49121918. rs49574859. rs126080910. rs275525111. rs1046498812. rs1764306013. rs928843214. rs192541015. rs193656616. rs1336116517. rs4148414818. rs20644119. rs1181380320. rs1096662521. rs1694612222. rs1712962723. rs186316724. rs807021225. rs470537726. rs91705127. rs789856528. rs150170029. rs86644930. rs32581SNPWeightedProportion0.00.20.40.60.81.00.06Crohn123455−1010−1515−20# SNPs with non−zero effects20−2525−3030−350.05(d) Top SNPsVarExp0.040.030.020.01SubsetMethodAll:lassoAll:logisticMHC:lassoMHC:logistic−MHC:lasso−MHC:logistic0.001 2 4 8 16 32 64 128 256 512 1024 2048Number of SNPs with non−zero effects(c) Explained phenotypic variance.Figure B.22.: AUC, PPV/NPV, and explained phenotypic variance for Crohn’s.247

B. Supplementary Results for Sparse Linear ModelsB.3.5. Hypertension (HT)Cases Controls SNPs Platform Autoimmune Prevalence1952 2938 459 012 ∗ Affy SNP 5.0 No 13.1%Table B.10.: HT dataset, autosomes only. Prevalence from NHS (2010).HT0.650.60AUC0.55SubsetMethodAll:lassoAll:logistic0.50PPV1.00.80.60.40.21 2 4 8 16 32 64 128 256 512 1024 2048Number of SNPs with non−zero effects● ● ●●● ● ●●●(a) AUC●● ●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●● ● ●0.88 0.90 0.92 0.94 0.96NPVMethod●lassologistic(b) PPV/NPV, based on models with 1000–2000 nonzeroSNPs for lasso and 200–300 for logistic regression.1. rs117681532. rs42604653. rs173259984. rs38534165. rs47158936. rs104584157. rs15026308. rs125410839. rs1702486210. rs928843211. rs1777097712. rs1020548713. rs465550614. rs1230758915. rs476552816. rs1684534217. rs735885618. rs1763319019. rs81631120. rs232009721. rs70006022. rs194358123. rs477352424. rs468703825. rs1016373426. rs378750627. rs96466728. rs57902529. rs490962930. rs17137672SNPWeightedProportion0.00.20.40.60.81.00.08HT123455−1010−1515−2020−2525−30# SNPs with non−zero effects30−3535−4040−450.06(d) Top SNPsVarExp0.04SubsetMethodAll:lassoAll:logistic0.020.001 2 4 8 16 32 64 128 256 512 1024 2048Number of SNPs with non−zero effects(c) Explained phenotypic variance.Figure B.23.: AUC, PPV/NPV, and explained phenotypic variance for Hypertension.248

B.3. Results for each datasetB.3.6. Rheumatoid Arthritis (RA)Cases Controls SNPs Platform Autoimmune Prevalence1860 2938 459 012 ∗ Affy SNP 5.0 Yes 0.75%Table B.11.: RA dataset, autosomes only. Prevalence from Wray et al. (2010).RA0.75AUC0.700.650.60SubsetMethodAll:lassoAll:logisticMHC:lassoMHC:logistic−MHC:lasso−MHC:logistic0.55PPV1.00.80.60.40.21 2 4 8 16 32 64 128 256 512 1024 2048Number of SNPs with non−zero effects●●●●●●●●●(a) AUC●●●●●●●●●● ● ●●●●●●0.993 0.994 0.995 0.996 0.997 0.998NPV●Method●lassologistic(b) PPV/NPV, based on models with 200–300 nonzeroSNPs for lasso and 50–100 for logistic regression.1. rs37633072. rs25176183. rs25175094. rs414144535. rs66789326. rs30998447. rs15598738. rs46344399. rs983723110. rs1687088011. rs203250012. rs285721013. rs313011314. rs289424915. rs207451216. rs458720717. rs927557218. rs98052619. rs4150975320. rs444035021. rs4148414822. rs267843523. rs224837324. rs156384925. rs91705126. rs1252931327. rs374994628. rs181073829. rs683365630. rs41365948SNPWeightedProportion0.00.20.40.60.81.00.12RA123455−1010−1515−20# SNPs with non−zero effects20−2525−3030−350.10(d) Top SNPsVarExp0.080.060.04SubsetMethodAll:lassoAll:logisticMHC:lassoMHC:logistic−MHC:lasso−MHC:logistic0.021 2 4 8 16 32 64 128 256 512 1024 2048Number of SNPs with non−zero effects(c) Explained phenotypic variance.Figure B.24.: AUC, PPV/NPV, and explained phenotypic variance for Rheumatoid Arthritis.249

B. Supplementary Results for Sparse Linear ModelsB.3.7. Type 1 Diabetes (WTCCC-T1D)Cases Controls SNPs Platform Autoimmune Prevalence1963 2938 459 012 ∗ Affy SNP 5.0 Yes 0.54%Table B.12.: T1D dataset, autosomes only. Prevalence from Wray et al. (2010).T1D0.85AUC0.800.750.700.65SubsetMethodAll:lassoAll:logisticMHC:lassoMHC:logistic−MHC:lasso−MHC:logistic0.600.55PPV1.0 ●0.80.60.40.21 2 4 8 16 32 64 128 256 512 1024 2048Number of SNPs with non−zero effects●●●●●●●(a) AUC●● ● ●●●●●●●●●●●●●●●●0.995 0.996 0.997 0.998 0.999NPVMethod●lassologistic(b) PPV/NPV, based on models with 200–300 nonzeroSNPs for lasso and 100–150 for logistic regression.1. rs69203632. rs92960033. rs69068464. rs49472965. rs168689436. rs12647027. rs66789328. rs31299329. rs938019710. rs945753711. rs313178412. rs929583713. rs4138104614. rs471338215. rs210158216. rs1704101817. rs1716547318. rs4151024919. rs4139904820. rs390913421. rs225463922. rs289424923. rs313532224. rs926118925. rs691374626. rs4152144727. rs250805228. rs714124929. rs1687088030. rs4438244SNPWeightedProportion0.00.20.40.60.81.00.300.25T1D123455−1010−1515−20# SNPs with non−zero effects(d) Top SNPs20−2525−30VarExp0.200.150.10SubsetMethodAll:lassoAll:logisticMHC:lassoMHC:logistic−MHC:lasso−MHC:logistic0.051 2 4 8 16 32 64 128 256 512 1024 2048Number of SNPs with non−zero effects(c) Explained phenotypic variance.Figure B.25.: AUC, PPV/NPV, and explained phenotypic variance for Type 1 Diabetes.250

B.3. Results for each datasetB.3.8. Type 2 Diabetes (T2D)Cases Controls SNPs Platform Autoimmune Prevalence1924 2938 459 012 ∗ Affy SNP 5.0 No 3%Table B.13.: T2D dataset, autosomes only. Prevalence from Wray et al. (2010).T2D0.65AUC0.60SubsetMethodAll:lassoAll:logistic0.55PPV1.00.80.60.40.21 2 4 8 16 32 64 128 256 512 1024 2048Number of SNPs with non−zero effects●●●●●●(a) AUC●●●●●●●●●●●●●●●●●●●●●●●●● ●0.975 0.980 0.985 0.990NPV●Method●lassologistic(b) PPV/NPV, based on models with 1000–2000 nonzeroSNPs for lasso and 250–300 for logistic regression.1. rs25130672. rs126395983. rs70842864. rs105099455. rs8963666. rs7000607. rs101637348. rs20578489. rs4148414810. rs1078737911. rs137023712. rs1718535513. rs1050570414. rs464155215. rs44061716. rs1235778117. rs434997618. rs1725754219. rs1082548120. rs723729321. rs141472822. rs214738423. rs153890124. rs477352425. rs1078628426. rs290986927. rs667646528. rs1688244029. rs1049026830. rs9560303SNPWeightedProportion0.00.20.40.60.81.00.060.05T2D123455−1010−1515−2020−25# SNPs with non−zero effects(d) Top SNPs25−3030−35VarExp0.040.03SubsetMethodAll:lassoAll:logistic0.020.010.001 2 4 8 16 32 64 128 256 512 1024 2048Number of SNPs with non−zero effects(c) Explained phenotypic variance.Figure B.26.: AUC, PPV/NPV, and explained phenotypic variance for Type 2 Diabetes.251

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●CSupplementary Results for FMPRTime (sec)0.60.4FMPRSPGTime (sec)864FMPRSPGTime (sec)10864FMPRSPG0.222●100 250 500 750 1000 2000 3000 4000 5000Samples N0● ● ●100 200 300 400 500 1000 1500 2000Variables p0● ● ● ● ● ● ●2 5 10 25 50 75 100 150 200Tasks K(a) Increasing samples N(b) Increasing parameters p(c) Increasing tasks KFigure C.1.: Time to run fmpr over 50 independent replications. (a) p = 400, K = 10. (b)N = 100, K = 10. (c) N = 100, p = 100.253

Bibliography1000 Genomes Project Consortium. A map of human genome variation from population-scalesequencing. Nature, 467:1061–1073, 2010.M. Ackermann and K. Strimmer. A general modular framework for gene set enrichmentanalysis. BMC Bioinfo., 10:47, 2009.Affymetrix, Inc. BRLMM: an Improved Genotype Calling Method for the GeneChip HumanMapping 500K Array Set. Technical report, 2006.A. Agresti. Categorical Data Analysis. Wiley, 2nd edition, 2002.M. Ala-Korpela. Critical evaluation of 1H NMR metabonomics of serum as a methodology fordisease risk assessment and diagnostics. Clinical chemistry and laboratory medicine CCLMFESCC, 46(1):27–42, 2008.A. Albergaria, J. Paredes, B. Sousa, F. Milanezi, V. Carneiro, J. Bastos, S. Costa, D. Vieira,N. Lopes, E. W. Lam, N. Lunet, and F. Schmitt. Expression of FOXA1 and GATA-3 inbreast cancer: the prognostic significance in hormone receptor-negative tumours. BreastCancer Research, 11:R40, 2009.U. Alon. An Introduction to Systems Biology. Chapman & Hall/CRC, 2007.D. G. Altman and J. M. Bland. Diagnostic tests 2: predictive values. BMJ, 309:102, 1994.C. Ambroise and G. J. McLachlan. Selection bias in gene extraction on the basis of microarraygene-expression data. Proc. Natl. Acad. Sci., 99:6562–6566, 2002.M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis,K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis,S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G. Sherlock.Gene Ontology: tool for the unification of biology. Nat. Genet., 25:25–29, 2000.J. E. Aten, T. F. Fuller, A. J. Lusis, and S. Horvath. Using genetic markers to orient theedges in quantitative trait networks: the NEO software. BMC Systems Biology, 2(1):34,2008.255

BibliographyK. L. Ayers and H. J. Cordell. SNP Selection in Genome-Wide and Candidate Gene Studiesvia Penalized Logistic Regression. Genet. Epidemiol., 34:879–891, 2010.F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with Sparsity-InducingPenalties. Technical report, INRIA, 2011.D. Baek, J. Villén, C. Shin, F. D. Camargo, S. P. Gygi, and D. P. Bartel. The impact ofmicroRNAs on protein output. Nature, 455:64–71, 2008.M. Bahlo, J. Stankovich, P. Danoy, P. F. Hickey, B. V. Taylor, S. R. Browning, M. a. Brown,and J. P. Rubio. Saliva-derived DNA performs well in large-scale, high-density singlenucleotidepolymorphism microarray studies. Cancer Epidemiology, Biomarkers & Prevention,19:794–8, 2010.E. Bair and R. Tibshirani. Semi-Supervised Methods to Predict Patient Survival from GeneExpression Data. PLoS Biology, 2:0511–0522, 2004.J. C. Barrett, S. Hansoul, D. L. Nicolae, J. H. Cho, R. H. Duerr, J. D. Rioux, S. R. Brant,M. S. Silverberg, K. D. Taylor, M. M. Barmada, A. Bitton, T. Dassopoulos, L. W. Datta,T. Green, A. M. Griffiths, E. O. Kistner, M. T. Murtha, M. D. Regueiro, J. I. Rotter,L. P. Schumm, A. H. Steinhart, S. R. Targan, R. J. Xavier, NIDDK IBD Genetics Consortium,C. Libioulle, C. Sandor, M. Lathrop, J. Belaiche, O. Dewit, I. Gut, S. Heath,D. Laukens, M. Mni, P. Rutgeerts, A. V. Gossum, D. Zelenika, D. Franchimont, J. P.Hugot, M. de Vos, S. Vermeire, E. Louis, Belgian-French IBD Consortium, Wellcome TrustCase Control Consortium, L. R. Cardon, C. A. Anderson, H. Drummond, E. Nimmo,T. Ahmad, N. J. Prescott, C. M. Onnie, S. A. Fisher, J. Marchini, J. Ghori, S. Bumpstead,R. Gwilliam, M. Tremelling, P. Deloukas, J. Mansfield, D. Jewell, J. Satsangi, C. G.Mathew, M. Parkes, M. Georges, and M. J. Daly. Genome-wide association defines morethan 30 distinct susceptibility loci for Crohn’s disease. Nat. Genet., 40:955–962, 2008.W. T. Barry, A. B. Nobel, and F. A. Wright. A statistical framework for testing functionalcategories in microarray data. Ann. Appl. Stat., 2:286–315, 2008.D. P. Bartel. MicroRNAs: Genomics, Biogenesis, Mechanism, and Function. Cell, 281–297,2004.P. Bebbington and R. Ramana. The epidemiology of bipolar affective disorder. Soc. PsychiatryPsychiatr. Epidemiol., 30:279–292, 1995.J. Bedo, C. Sanderson, and A. Kowalczyk. An Efficient Alternative to SVM Based RecursiveFeature Elimination with Applications in Natural Language Processing and Bioinformatics.In A. Sattar and B. H. Kang, editors, Proc. Aust. Joint Conf. AI, 2006.256

BibliographyY. Benjamini and Y. Hochberg. Controlling the False Discovery Rate: a Practical and PowerfulApproach to Multiple Testing. J. R. Statist. Soc., 57:289–300, 1995.A. H. Bild, G. Yao, J. T. Chang, Q. Wang, . Potti, D. Chasse, M.-B. Joshi, D. Harpole, J. M.Lancaster, A. Berchuck, J. A. Olson, J. R. Marks, H. K. Dressman, M. West, and J. R.Nevins. Oncogenic pathway signatures in human cancers as a guide to targeted therapies.Nature, 439(7074):353–357, 2006.H. Binder and M. Schumacher. Adapting Prediction Error Estimates for Biased ComplexitySelection in High-Dimensional Bootstrap Samples. Statist. Appl. Genet. Mol. Biol., 7, 2008.D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation. J. Mach. Learn. Res.,3:993–1022, 2003.B. J. Blencowe. Alternative Splicing: New Insights from Global Analyses. Cell, 126:37–47,2006.T. H. Bø, B. Dysvik, and I. Jonassen. LSimpute: accurate estimation of missing values inmicroarray data with least squares methods. Nucleic Acid Res., 32:e34, 2004.W. Bodmer and C. Bonilla. Common and rare variants in multifactorial susceptibility tocommon diseases. Nat. Genet., 40:695–701, 2008.B. M. Bolstad, R. A. Irizarry, M. Åstrand, and T. P. Speed. A comparison of normalizationmethods for high density oligonucleotide array data based on variance and bias. Bioinformatics,19:185–193, 2003.J.-P. Bonnefont, F. Djouadi, C. Prip-Buus, S. Gobin, A. Munnich, and J. Bastin. Carnitinepalmitoyltransferases 1 and 2: biochemical, molecular and medical aspects. MolecularAspects of Medicine, 25:495–520, 2004.L. Bottou and Y. LeCun. Large scale online learning. In S. Thrun, L. Saul, and B. Schölkopf,editors, Advances in Neural Information Processing Systems 16. MIT Press, Cambridge,M. A, 2004.S. Boyd and L. Vandenbergh. Convex Optimization. Cambridge University Press, 2004.J. K. Bradley, A. Kyrola, D. Bickson, and C. Guestrin. Parallel Coordinate Descent forL1-Regularized Loss Minimization. In ICML 2011, Proceedings of the 28th InternationalConference on Machine Learning, 2011.L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996.L. Breiman. Random forests. Machine Learning, 45:5–32, 2001.257

BibliographyH. Brentani, O. L. Caballero, A. A. Camargo, A. M. da Silva, W. A. da Silva, E. D. Neto,M. Grivet, A. Gruber, P. E. M. Guimaraes, W. Hide, C. Iseli, C. V. Jongeneel, J. Kelso,M. A. Nagai, E. P. B. Ojopi, et al. The generation and utilization of a cancer-orientedrepresentation of the human transcriptome by using expressed sequence tags. Proc. Natl.Acad. Sci., 100:13148–13423, 2003.S. R. Browning and B. L. Browning. Rapid and accurate haplotype phasing and missing-datainference for whole-genome association studies by use of localized haplotype clustering. Am.J. Hum. Genet., 81:1084–1097, 2007.S. R. Browning and B. L. Browning. Population structure can inflate SNP-based heritabilityestimates. American Journal of Human Genetics, 89:191–3; author reply 193–5, 2011.M. Buyse, S. Loi, L. van ’t Veer, G. Viale, M. Delorenzi, A. M. Glas, M. S. d’Assignies,J. Bergh, R. Lidereau, P. Ellis, A. Harris, J. Bogaerts, P. Therasse, A. Floore, M. Amakrane,F. Piette, E. Rutgers, C. Sotiriou, F. Cardoso, M. J. Piccart, and T. Consortium. Validationand clinical utility of a 70-gene prognostic signature for women with node-negative breastcancer. J. Natl. Cancer Inst., 98:1183–1192, 2006.C. Carlson, M. Eberle, M. Rieder, and Q. Yi. Selecting a maximally informative set of singlenucleotidepolymorphisms for association analyses using linkage disequilibrium. AmericanJournal of Human Genetics, 74:106–20, 2004.M. J. Carter, A. J. Lobo, S. P. Travis, and IBD Section, British Society of Gastroenterology.Guidelines for the management of inflammatory bowel disease in adults. Gut, 53 (Suppl5):V1–V16, 2004.K.-W. Chang, C.-J. Hsieh, and C.-J. Lin. Coordinate Descent Method for Large-scale L2-lossLinear Support Vector Machines. J. Mach. Learn. Res., 9:1369–1398, 2008.A. Chase, T. Ernst, A. Fiebig, A. Collins, F. Grand, P. Erben, A. Reiter, S. Schreiber, andN. C. P. Cross. TFG, a target of chromosome translocations in lymphoma and soft tissuetumors, fuses to GPR128 in healthy individuals. Haematologica, 95:20–6, 2010.Y. Chen, J. Zhu, P.-Y. Lum, X. Yang, S. Pinto, D. J. MacNeil, C. Zhang, J. Lamb, S. Edwards,S. K. Sieberts, A. Leonardseon, L. W. Castelli, S. Wang, M.-F. Champy, B. Zhang,V. Emilsson, S. Doss, A. Ghazalpour, S. Horvath, T. A. Drake, A. J. Lusis, and E. E. Schadt.Variations in DNA elucidate molecular networks that cause disease. Nature, 452:429–435,2008.X. Chen, Q. Lin, S. Kim, J. G. Carbonell, and E. P. Xing. A smoothing proximal gradientmethod for general structured sparse regression. Ann. Appl. Statist., 2012. To appear.258

BibliographyL. Chin, J. N. Andersen, and P. A. Futreal. Cancer genomics: from discovery science topersonalized medicine. Nature Medicine, 17(3):297–303, 2011.H.-Y. Chuang, E. Lee, Y.-T. Liu, D. Lee, and T. Ideker. Network-based classification of breastcancer metastasis. Mol. Sys. Biol., 3, 2007.G. M. Clarke, C. a. Anderson, F. H. Pettersson, L. R. Cardon, A. P. Morris, and K. T.Zondervan. Basic statistical analysis in genetic case-control studies. Nature Protocols,6(2):121–33, 2011.D. G. Clayton. Prediction and interaction in complex disease genetics: experience in type 1diabetes. PLoS Genet., 5:e1000540, 2009.D. G. Clayton. Sex chromosomes and genetic association studies. Genome Med., 1:110, 2009.J. Cohen, P. Cohen, S. G. West, and L. S. Aiken. Applied Multiple Regression/CorrelationAnalysis for the Behavioral Sciences. Lawrence Erlbaum Associates, 3rd edition, 2003.W. Cookson, L. Liang, G. Abecasis, M. Moffatt, and M. Lathrop. Mapping complex diseasetraits with global gene expression. Nat. Rev. Genet., 10:184–194, 2009.H. J. Cordell. Detecting gene-gene interactions that underlie human diseases. Nat. Rev.Genet., 10:392–404, 2009.E. Cosgun, N. a. Limdi, and C. W. Duarte. High-dimensional pharmacogenetic predictionof a continuous trait using machine learning techniques with application to warfarin doseprediction in African Americans. Bioinformatics, 27(10):1384–9, 2011.C. Curtis, S. P. Shah, S.-F. Chin, G. Turashvili, O. M. Rueda, M. J. Dunning, D. Speed,A. G. Lynch, S. Samarajiwa, Y. Yuan, S. Gräf, G. Ha, G. Haffari, A. Bashashati, R. Russell,S. McKinney, C. Caldas, S. Aparicio, C. Curtis, J. D. Brenton, I. Ellis, D. Huntsman,S. Pinder, A. Purushotham, L. Murphy, H. Bardwell, Z. Ding, L. Jones, B. Liu,I. Papatheodorou, S. J. Sammut, G. Wishart, S. Chia, K. Gelmon, C. Speers, P. Watson,R. Blamey, A. Green, D. Macmillan, E. Rakha, C. Gillett, A. Grigoriadis, E. di Rinaldis,A. Tutt, M. Parisien, S. Troup, D. Chan, C. Fielding, A.-T. Maia, S. McGuire, M. Osborne,S. M. Sayalero, I. Spiteri, J. Hadfield, L. Bell, K. Chow, N. Gale, M. Kovalik, Y. Ng,L. Prentice, S. Tavaré, F. Markowetz, A. Langerø d, E. Provenzano, and A.-L. Bø rresenDale. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novelsubgroups. Nature, pages 1–7, 2012.A. R. Dabney and J. D. Storey. Optimality driven nearest centroid classification from genomicdata. PLoS One, 2:e1002, 2007.259

BibliographyH. Dai, L. van’t Veer, J. Lamb, Y. D. He, M. Mao, B. M. Fine, R. Bernards, M. van deVijver, P. Deutsch, A. Sachs, R. Stoughton, and S. Friend. A cell proliferation signature isa marker of extremely poor outcome in a subpopulation of breast cancer patients. CancerRes., 65:4059–4066, 2005.I. Daubechies, M. Defrise, and C. D. Mol. An Iterative Thresholding Algorithm for LinearInverse Problems with a Sparsity Constraint. Comm. Pure Appl. Math., LVII:1413–1457,2004.C. Desmedt, F. Piette, S. Loi, Y. Wang, F. Lallemand, B. Haibe-Kains, G. Viale, M. Delorenzi,Y. Zhang, M. S. d’Assignies, J. Bergh, R. Lidereau, P. Ellis, A. L. Harris, J. G. Klijn, J. A.Foekens, F. Cardoso, M. J. Piccart, M. Buyse, C. Sotiriou, and T. Consortium. Strong timedependence of the 76-gene prognostic signature for node-negative breast cancer patients inthe TRANSBIG multicenter independent validation series. Clin. Cancer Res., 13:3207–3214, 2007.C. Desmedt, B. Haibe-Kains, P. Wirapati, M. Buyse, D. Larsimont, G. Bontempi, M. Delorenzi,M. Piccart, and C. Sotiriou. Biological processes associated with breast cancerclinical outcome depend on the molecular subtypes. Clin. Cancer Res., 14:5158–5165,2008.B. Devlin and K. Roeder. Genomic Control for Association Studies. Biometrics, 55:997–1004,1999.S. Doss, E. E. Schadt, T. A. Drake, and A. J. Lusis. Cis-acting expression quantitative traitloci in mice. Genome Res., 15:681–691, 2005.J. Downward. Targeting ras signalling pathways in cancer therapy. Nat. Rev. Cancer, 3:11–22,2003.P. C. A. Dubois, G. Trynka, L. Franke, K. A. Hunt, J. Romanos, A. Curtotti, A. Zhernakova,G. A. R. Heap, R. Ádány, et al. Multiple common variants for celiac disease influencingimmune gene expression. Nat. Genet., 42:295–304, 2010.J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto the l 1 -ball for learning in high dimensions. In ICML 2008, Proceedings of the 25th InternationalConference on Machine Learning, 2008.J. Dupuis, C. Langenberg, I. Prokopenko, R. Saxena, N. Soranzo, A. U. Jackson, E. Wheeler,N. L. Glazer, N. Bouatia-Naji, A. L. Gloyn, C. M. Lindgren, R. Mägi, A. P. Morris,J. Randall, T. Johnson, P. Elliott, D. Rybin, G. Thorleifsson, V. Steinthorsdottir, P. Henneman,H. Grallert, A. Dehghan, J. J. Hottenga, C. S. Franklin, P. Navarro, K. Song,A. Goel, J. R. B. Perry, J. M. Egan, T. Lajunen, N. Grarup, T. Sparsø, A. Doney, B. F.260

BibliographyVoight, H. M. Stringham, M. Li, S. Kanoni, P. Shrader, C. Cavalcanti-Proença, M. Kumari,L. Qi, N. J. Timpson, C. Gieger, C. Zabena, G. Rocheleau, E. Ingelsson, P. An,J. O’Connell, J. Luan, A. Elliott, S. a. McCarroll, F. Payne, R. M. Roccasecca, F. Pattou,P. Sethupathy, K. Ardlie, Y. Ariyurek, B. Balkau, P. Barter, J. P. Beilby, Y. Ben-Shlomo,R. Benediktsson, A. J. Bennett, S. Bergmann, M. Bochud, E. Boerwinkle, A. Bonnefond,L. L. Bonnycastle, K. Borch-Johnsen, Y. Böttcher, E. Brunner, S. J. Bumpstead, G. Charpentier,Y.-D. I. Chen, P. Chines, R. Clarke, L. J. M. Coin, M. N. Cooper, M. Cornelis,G. Crawford, L. Crisponi, I. N. M. Day, E. J. C. de Geus, J. Delplanque, C. Dina, M. R.Erdos, A. C. Fedson, A. Fischer-Rosinsky, N. G. Forouhi, C. S. Fox, R. Frants, M. G.Franzosi, P. Galan, M. O. Goodarzi, J. Graessler, C. J. Groves, S. Grundy, R. Gwilliam,U. Gyllensten, S. Hadjadj, G. Hallmans, N. Hammond, X. Han, A.-L. Hartikainen, N. Hassanali,C. Hayward, S. C. Heath, S. Hercberg, C. Herder, A. a. Hicks, D. R. Hillman,A. D. Hingorani, A. Hofman, J. Hui, J. Hung, B. Isomaa, P. R. V. Johnson, T. Jø rgensen,A. Jula, M. Kaakinen, J. Kaprio, Y. A. Kesaniemi, M. Kivimaki, B. Knight, S. Koskinen,P. Kovacs, K. O. Kyvik, G. M. Lathrop, D. a. Lawlor, O. Le Bacquer, C. Lecoeur,Y. Li, V. Lyssenko, R. Mahley, M. Mangino, A. K. Manning, M. T. Martínez-Larrad,J. B. McAteer, L. J. McCulloch, R. McPherson, C. Meisinger, D. Melzer, D. Meyre, B. D.Mitchell, M. a. Morken, S. Mukherjee, S. Naitza, N. Narisu, M. J. Neville, B. a. Oostra,M. Orrù, R. Pakyz, C. N. a. Palmer, G. Paolisso, C. Pattaro, D. Pearson, J. F. Peden,N. L. Pedersen, M. Perola, A. F. H. Pfeiffer, I. Pichler, O. Polasek, D. Posthuma, S. C.Potter, A. Pouta, M. a. Province, B. M. Psaty, W. Rathmann, N. W. Rayner, K. Rice,S. Ripatti, F. Rivadeneira, M. Roden, O. Rolandsson, A. Sandbaek, M. Sandhu, S. Sanna,A. A. Sayer, P. Scheet, L. J. Scott, U. Seedorf, S. J. Sharp, B. Shields, G. Sigurethsson,E. J. G. Sijbrands, A. Silveira, L. Simpson, A. Singleton, N. L. Smith, U. Sovio, A. Swift,H. Syddall, A.-C. Syvänen, T. Tanaka, B. Thorand, J. Tichet, A. Tönjes, T. Tuomi, A. G.Uitterlinden, K. W. van Dijk, M. van Hoek, D. Varma, S. Visvikis-Siest, V. Vitart, N. Vogelzangs,G. Waeber, P. J. Wagner, A. Walley, G. B. Walters, K. L. Ward, H. Watkins,M. N. Weedon, S. H. Wild, G. Willemsen, J. C. M. Witteman, J. W. G. Yarnell, E. Zeggini,D. Zelenika, B. Zethelius, G. Zhai, J. H. Zhao, M. C. Zillikens, I. B. Borecki, R. J. F.Loos, P. Meneton, P. K. E. Magnusson, D. M. Nathan, G. H. Williams, A. T. Hattersley,K. Silander, V. Salomaa, G. D. Smith, S. R. Bornstein, P. Schwarz, J. Spranger, F. Karpe,A. R. Shuldiner, C. Cooper, G. V. Dedoussis, M. Serrano-Ríos, A. D. Morris, L. Lind, L. J.Palmer, F. B. Hu, P. W. Franks, S. Ebrahim, M. Marmot, W. H. L. Kao, J. S. Pankow,M. J. Sampson, J. Kuusisto, M. Laakso, T. Hansen, O. Pedersen, P. P. Pramstaller, H. E.Wichmann, T. Illig, I. Rudan, A. F. Wright, M. Stumvoll, H. Campbell, J. F. Wilson, R. N.Bergman, T. a. Buchanan, F. S. Collins, K. L. Mohlke, J. Tuomilehto, T. T. Valle, D. Altshuler,J. I. Rotter, D. S. Siscovick, B. W. J. H. Penninx, D. I. Boomsma, P. Deloukas,T. D. Spector, T. M. Frayling, L. Ferrucci, A. Kong, U. Thorsteinsdottir, K. Stefansson,261

BibliographyC. M. van Duijn, Y. S. Aulchenko, A. Cao, A. Scuteri, D. Schlessinger, M. Uda, A. Ruokonen,M.-R. Jarvelin, D. M. Waterworth, P. Vollenweider, L. Peltonen, V. Mooser, G. R.Abecasis, N. J. Wareham, R. Sladek, P. Froguel, R. M. Watanabe, J. B. Meigs, L. Groop,M. Boehnke, M. I. McCarthy, J. C. Florez, and I. Barroso. New genetic loci implicatedin fasting glucose homeostasis and their impact on type 2 diabetes risk. Nature Genetics,42(2):105–16, 2010.a. B. Dydensborg, a. a. N. Rose, B. J. Wilson, D. Grote, M. Paquet, V. Giguère, P. M. Siegel,and M. Bouchard. GATA3 inhibits breast cancer growth and pulmonary breast cancermetastasis. Oncogene, 28:2634–42, 2009.F. Eckhardt, J. Lewin, R. Cortese, V. K. Rakyan, J. Attwood, M. Burger, J. Burton, T. V.Cox, R. Davies, T. A. Down, C. Haefliger, R. Horton, K. Howe, D. Jackson, J. Kunde,C. Koenig, J. Liddle, D. Niblett, T. Otto, R. Pettett, S. Seemann, C. Thompson, T. West,J. Rogers, A. Olek, K. Berlin, and S. Beck. DNA methylation profiling of human chromosomes6, 20 and 22. Nat. Genet., 38:1378–1385, 2006.R. Edgar, M. Domrachev, and A. E. Lash. Gene Expression Omnibus: NCBI gene expressionand hybridization array data repository. Nucleic Acid Res., 30:207–210, 2002.B. Efron and R. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, 1993.B. Efron and R. Tibshirani. On testing the significance of sets of genes. Annal. Stat., 1:107–129, 2007.B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least Angle Regression. Ann. Stat.,32:407–499, 2004.E. E. Eichler, J. Flint, G. Gibson, A. Kong, S. M. Leal, J. H. Moore, and J. H. Nadeau.Missing heritability and strategies for finding the underlying causes of complex disease.Nat. Rev. Genet., 11:446–450, 2010.L. Ein-Dor, I. Kela, G. Getz, D. Givol, and E. Domany. Outcome signature genes in breastcancer: is there a unique set? Bioinformatics, 21:171–178, 2005.L. Ein-Dor, O. Zuk, and E. Domany. Thousands of samples are needed to generate a robustgene list for predicting outcome in cancer. Proc. Natl. Acad. Sci., 103:5923–5928, 2006.H. Eleftherohorinou, V. Wright, C. Hoggart, A. L. Hartikainen, et al. Pathway analysis ofGWAS provides new insights into genetic susceptibility to 3 inflammatory diseases. PLoSONE, 4:e8068, 2009.V. Emilsson, G. Thorleifsson, B. Zhang, A. S. Leonardson, F. Zink, J. Zhu, S. Carlson,A. Helgason, G. B. Walters, S. Gunnarsdottir, M. Mouy, V. Steinthorsdottir, G. H. Eiriksdottir,G. Bjornsdottir, I. Reynisdottir, D. Gudbjbartsson, A. Helgadottir, A. Jonasdottir,262

BibliographyY. Freund and R. E. Schapire. A desicion-theoretic generalization of on-line learning and anapplication to boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997.J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view ofboosting. The Annals of Statistics, 28:337–407, 2000.J. Friedman, T. Hastie, H. Höfling, and R. Tibshirani. Pathwise coordinate optimization.Ann. Appl. Statist., 1:302–332, 2007.J. Friedman, T. Hastie, and R. Tibshirani. Regularization Paths for Generalized LinearModels via Coordinate Descent. J. Stat. Soft., 33, 2010.J. Friedman. Greedy Function Approximation: A Gradient Boosting Machine. Annals ofStatistics, 29:1189–1232, 2001.W. J. Fu. Penalized regressions: the bridge versus the lasso. J. Comput. Graph. Stat.,7:397–416, 1998.A. Gelman and J. Hill. Data Analysis Using Regression and Multilevel/Hierarchical Models.Cambridge University Press, 2007.G. Gibson. Rare and common variants: twenty arguments. Nat. Rev. Genet., 13(2):135–45,2011.Y. Gilad, S. A. Rifkin, and J. K. Pritchard. Revealing the architecture of gene regulation:the promise of eQTL studies. Trends. Genet., 24:408–415, 2008.J. J. Goeman and P. Bühlmann. Analyzing gene expression data in terms of gene sets:methodological issues. Bioinformatics, 23:980–987, 2007.J. Goeman. penalized: L1 (lasso) and L2 (ridge) penalized estimation in GLMs and in theCox model, 2008. R package version 0.9-22.A. D. Goldberg, C. D. Allis, and E. Bernstein. Epigenetics: A Landscape Takes Shape. Cell,128:635–638, 2007.T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller,M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. MolecularClassification of Cancer: Class Discovery and Class Prediction by Gene ExpressionMonitoring. Science, 286:531–537, 1999.H. H. H. Göring, J. E. Curran, M. P. Johnson, T. D. Dyer, J. Charlesworth, S. A. Cole, J. B. M.Jowett, L. J. Abraham, D. L. Rainwater, A. G. Comuzzie, M. C. Mahaney, L. Almasy,J. W. MacCluer, A. H. Kissebah, G. R. Collier, E. K. Moses, and J. Blangero. Discoveryof expression QTLs using large-scale transcriptional profiling in human lymphocytes. Nat.Genet., 39:1208–1216, 2007.264

BibliographyV. Guerrero. Time-series analysis supported by power transformations. Journal of Forecasting,12(1):37–48, 1993.Z. Guo, T. Zhang, X. Li, Q. Wang, J. Xu, H. Yu, J. Zhu, H. Wang, C. Wang, E. J. Topol,Q. Wang, and S. Rao. Towards precise classification of cancers based on robust genefunctional expression profiles. BMC Bioinfo., 6:article 58, 2005.I. Guyon and A. Elisseeff. An Introduction to Variable and Feature Selection. J. Mach. Learn.Res., 3:1157–1182, 2003.I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene Selection for Cancer Classificationusing Support Vector Machines. Mach. Learn., 46:389–422, 2002.B. Haibe-Kains, C. Desmedt, C. Sotiriou, and G. Bontempi. A comparative study of survivalmodels for breast cancer prognostication based on microrarray data: does a single genebeat them all? Bioinformatics, 24:2200–2208, 2008.J. A. Hanley and B. J. McNeil. The Meaning and Use of the Area under a Receiver OperatingCharacteristic (ROC) Curve. Radiology, 143:29–36, 1982.T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer,2nd edition, 2009.T. Hastie, R. Tibshirani, B. Narasimhan, and G. Chu. pamr: Pam: prediction analysis formicroarrays, 2009. R package version 1.42.0.P. W. Hedrick. Genetics of Poplations. Jones & Bartlett, 4th edition, 2009.L. Hernández, M. Pinyol, S. Hernández, S. Beà, K. Pulford, a. Rosenwald, L. Lamant,B. Falini, G. Ott, D. Y. Mason, G. Delsol, and E. Campo. TRK-fused gene (TFG) isa new partner of ALK in anaplastic large cell lymphoma producing two structurally differentTFG-ALK translocations. Blood, 94:3265–8, 1999.W. G. Hill, M. E. Goddard, and P. M. Visscher. Data and theory point to mainly additivegenetic variance for complex traits. PLoS Genetics, 4(2):e1000008, 2008.L. A. Hindorff, P. Sethupathy, H. A. Junkins, E. M. Ramos, J. P. Mehta, F. S. Collins, andT. A. Manolio. Potential etiologic and functional implications of genome-wide associationloci for human diseases and traits. Proc. Natl. Acad. Sci., 106:9362–9367, 2009.D. A. Hinds, L. L. Stuve, G. B. Nilsen, E. Halperin, E. Eskin, D. G. Ballinger, K. A. Frazer,and D. R. Cox. Whole-genome patterns of common DNA variation in three human populations.Science, 307(5712):1072–9, 2005.265

BibliographyC. J. Hoggart, J. C. Whittaker, M. D. Iorio, and D. J. Balding. Simultaneous analysis ofall SNPs in genome-wide and re-sequencing association studies. PLoS Genet., 4:e1000130,2008.M. Hollander and D. A. Wolfe. Nonparametric Statistical Methods. Wiley-Interscience, 2ndedition, 1999.E. Holmes, I. D. Wilson, and J. K. Nicholson. Metabolic Phenotyping in Health and Disease.Cell, 134:714–717, 2008.B. N. Howie, P. Donnelly, and J. Marchini. A Flexible and Accurate Genotype ImputationMethod for the Next Generation of Genome-Wide Association Studies. PLoS Genet.,5:e1000529, 2009.B. Howie, J. Marchini, and M. Stephens. Genotype imputation with thousands of genomes.G3: Genes, Genomes, Genetics, 1(6):457–70, 2011.V. Iacobazzi, F. Invernizzi, S. Baratta, R. Pons, W. Chung, B. Garavaglia, C. Dionisi-Vici,A. Ribes, R. Parini, M. D. Huertas, S. Roldan, G. Lauria, F. Palmieri, and F. Taroni.Molecular and functional analysis of SLC25A20 mutations causing carnitine-acylcarnitinetranslocase deficiency. Human Mutation, 24:312–20, 2004.M. Inouye, J. Kettunen, P. Soininen, S. Ripatti, L. S. Kumpula, E. Hämäläinen, P. Jousilahti,A. J. Kangas, S. Männistö, M. J. Savolainen, A. Jula, J. Leiviskä, A. Palotie, V. Salomaa,M. Perola, M. Ala-Korpela, and L. Peltonen. Metabonomic, transcriptomic, and geneticvariation of a population cohort. Mol. Sys. Biol., 6:441, 2010.M. Inouye, K. Silander, E. Hamalainen, V. Salomaa, K. Harald, et al. An Immune ResponseNetwork Associated with Blood Lipid Levels. PLoS Genet., 6:e1001113, 2010.International HapMap 3 Consortium. Integrating common and rare genetic variation in diversehuman populations. Nature, 467:52–58, 2010.International HapMap Consortium. A haplotype map of the human genome. Nature,437:1299–1320, 2005.International HapMap Consortium. A second generation human haplotype map of over 3.1million SNPs. Nature, 449:851–861, 2007.R. A. Irizarry, B. M. Bolstad, F. Collin, L. M. Cope, B. Hobbs, and T. P. Speed. Summariesof Affymetrix GeneChip probe level data. Nucleic Acid Res., 31:e15, 2003.R. A. Irizarry, B. Hobbs, F. Collin, Y. D. Beazer-Barclay, K. J. Antonellis, U. Scherf, andT. P. Speed. Exploration, Normalization, and Summaries of High Density OligonucleotideArray Probe Level Data. Biostatistics, 4:249–264, 2003.266

BibliographyA. V. Ivshina, J. George, O. Senko, B. Mow, T. C. Putti, J. Smeds, T. Lindahl, Y. Pawitan,P. Hall, H. Nordgren, J. E. Wong, E. T. Liu, J. Bergh, V. A. Kuznetsov, and L. D. Miller.Genetic reclassification of histologic grade delineates new clinical subtypes of breast cancer.Cancer Res., 66:10292–10301, 2006.L. Jacob, G. Obozinski, and J.-P. Vert. Group Lasso with Overlap and Graph Lasso. In ICML2009, Proceedings of of the 26th International Conference on Machine Learning, 2009.R. C. Jansen and J.-P. Nap. Genetical genomics: the added value from segregation. TrendsGenet., 17:388–393, 2001.T. Kam-Thong, B. Pütz, N. Karbalai, B. Müller-Myhsok, and K. Borgwardt. Epistasis detectionon quantitative phenotypes by exhaustive enumeration using GPUs. Bioinformatics,27(13):i214–i221, 2011.M. Kanehisa and S. Goto. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic AcidRes., 28:27–30, 2000.M. Kanehisa, S. Goto, M. Furumichi, M. Tanabe, and M. Hirakawa. KEGG for representationand analysis of molecular networks involving diseases and drugs. Nucl. Acid. Res., 38:D355–D360, 2010.A. Karatzoglou, A. Smola, K. Hornik, and A. Zeileis. kernlab – an S4 package for kernelmethods in R. Journal of Statistical Software, 11(9):1–20, 2004.S.-Y. Kim and Y.-S. Kim. A gene sets approach for identifying prognostic gene signatures foroutcome prediction. BMC Genomics, 9, 2008.S. Kim and E. P. Xing. Statistical estimation of correlated genome associations to a quantitativetrait network. PLoS Genet., 5:e1000587, 2009.S. Kim and E. P. Xing. Exploiting Genome Structure in Association Analysis. J. Comput.Biol., 18:1–16, 2011.H. Kim, G. H. Golub, and H. Park. Missing value estimation for DNA microarray geneexpression data: local least squares imputation. Bioinformatics, 21:187–98, 2005.P.-M. Kloetzel. The proteasome and MHC class I antigen processing. Biochimica et BiophysicaActa, 1695:225–233, 2004.K. Knight and W. Fu. Asymptotics for lasso-type estimators. Ann. Stat., 28:1356–1378, 2000.C. Kooperberg, M. Leblanc, and V. Obenchain. Risk prediction using genome-wide associationstudies. Genet. Epidemiol., 34:643–652, 2010.267

BibliographyC. M. Lindgren, I. M. Heid, J. C. Randall, C. Lamina, V. Steinthorsdottir, L. Qi, E. K. Speliotes,G. Thorleifsson, C. J. Willer, B. M. Herrera, A. U. Jackson, N. Lim, P. Scheet,N. Soranzo, N. Amin, Y. S. Aulchenko, J. C. Chambers, A. Drong, J. Luan, H. N.Lyon, F. Rivadeneira, S. Sanna, N. J. Timpson, M. C. Zillikens, J. H. Zhao, P. Almgren,S. Bandinelli, A. J. Bennett, R. N. Bergman, L. L. Bonnycastle, S. J. Bumpstead, S. J.Chanock, L. Cherkas, P. S. Chines, L. Coin, C. Cooper, G. Crawford, A. Doering, A. Dominiczak,A. S. F. Doney, S. Ebrahim, P. Elliott, M. R. Erdos, K. Estrada, L. Ferrucci,G. Fischer, N. G. Forouhi, C. Gieger, H. Grallert, C. J. Groves, S. Grundy, C. Guiducci,D. Hadley, A. Hamsten, A. S. Havulinna, A. Hofman, R. Holle, J. W. Holloway, T. Illig,B. Isomaa, L. C. Jacobs, K. Jameson, P. Jousilahti, F. Karpe, J. Kuusisto, J. Laitinen, G. M.Lathrop, D. A. Lawlor, M. Mangino, W. L. McArdle, T. Meitinger, M. Morken, A. P. Morris,P. Munroe, N. Narisu, A. Nordstrm, P. Nordstrm, B. A. Oostra, C. N. Palmer, F. Payne,J. F. Peden, I. Prokopenko, F. Renstrm, A. Ruokonen, V. Salomaa, M. S. Sandhu, L. J.Scott, A. Scuteri, K. Silander, K. Song, X. Yuan, H. M. Stringham, A. J. Swift, T. Tuomi,M. Uda, P. Vollenweider, G. Waeber, C. Wallace, G. B. Walters, M. N. Weedon, W. Consortium,J. C. Witteman, C. Zhang, W. Zhang, M. J. Caulfield, F. S. Collins, G. D. Smith,I. N. Day, P. W. Franks, A. T. Hattersley, F. B. Hu, M. R. Jarvelin, A. Kong, J. S. Kooner,M. Laakso, E. Lakatta, V. Mooser, A. D. Morris, L. Peltonen, N. J. Samani, T. D. Spector,D. P. Strachan, T. Tanaka, J. Tuomilehto, A. G. Uitterlinden, C. M. van Duijn, N. J. Wareham,H. Watkins, P. Consortia, D. M. Waterworth, M. Boehnke, P. Deloukas, L. Groop,D. J. Hunter, U. Thorsteinsdottir, D. Schlessinger, H. E. Wichmann, T. M. Frayling, G. R.Abecasis, J. N. Hirschhorn, R. J. Loos, K. Stefansson, K. L. Mohlke, I. Barroso, M. I. Mccarthy,and G. Consortium. Genome-wide association scan meta-analysis identifies threeloci influencing adiposity and fat distribution. PLoS Genet., 5:e1000508, 2009.B. Liu, L. Liu, A. Tsykin, G. Goodall, J. Green, M. Zhu, C. Kim, and J. Li. Identifying functionalmiRNA-mRNA regulatory modules with correspondence latent dirichlet allocation.Bioinformatics, 26:3105–3111, 2010.B. A. Logsdon, G. E. Hoffman, and J. G. Mezey. A variational Bayes algorithm for fast andaccurate multiple locus genome-wide association analysis. BMC Bioinfo., 11:58, 2010.K. E. Lohmueller, C. L. Pearce, M. Pike, E. S. Lander, and J. N. Hirschhorn. Meta-anslysisof genetic association studies supports a contribution of common variants to susceptibilityto common disease. Nat. Genet., 33:177–182, 2003.S. Loi, B. Haibe-Kains, C. Desmedt, F. Lallemand, A. M. Tutt, C. Gillet, P. Ellis, A. Harris,J. Bergh, J. A. Foekens, J. G. M. Klijn, D. Larsimont, M. Buyse, G. Bontempi, M. Delorenzi,M. J. Piccart, and C. Sotiriou. Definition of clinically distinct molecular subtypesin estrogen receptor-positive breast carcinomas through genomic grade. J. Clin. Oncol.,25:1239–1246, 2007.269

BibliographyT. A. Mckinsey, K. Kuwahara, S. Bezprozvannaya, and E. N. Olson. Responsiveness tothe Ankyrin-Repeat Proteins ANKRA2 and RFXANK. Molecular Biology of the Cell,17(January):438–447, 2006.G. J. McLachlan, K.-A. Do, and C. Ambroise. Analyzing Microarray Gene Expression Data.Wiley Interscience, 2004.D. W. Mehlman, U. L. Sheperd, and D. A. Kelt. Bootstrapping Principal ComponentsAnalysis: A Comment. Ecology, 76:640–643, 1995.L. Meier, S. van de Geer, and P. Bühlmann. The group lasso for logistic regression. J. R.Statist. Soc. B, 70:53–71, 2008.N. Meinshausen and P. Bühlmann. High dimensional graphs and variable selection with thelasso. Annal. Stat., 34:1436–1462, 2006.N. Meinshausen and P. Bühlmann. Stability selection. J. Royal Soc. Stats. B, 72:417–473,2010.I. Mérida, A. Avila-Flores, and E. Merino. Diacylglycerol kinases: at the hub of cell signalling.The Biochemical journal, 409(1):1–18, 2008.S. Michiels, S. Koscielny, and C. Hill. Prediction of cancer outcome with microarrays: amultiple random validation study. The Lancet, 365:488–492, 2005.C. Miranda, E. Roccato, G. Raho, S. Pagliardini, M. A. Pierotti, and A. Greco. The TFGProtein, Involved in Oncogenic Rearrangements, Interacts With TANK and NEMO, TwoProteins Involved in the NF-kB Pathway. Journal of Cellular Physiology, 160:154–160,2006.B. Modrek and C. Lee. A genomic view of alternative splicing. Nat. Genet., 30:13–19, 2002.J. H. Moore and S. M. Williams. Epistasis and Its Implications for Personal Genetics. Am.J. Hum. Genet., 85:309–320, 2009.J. H. Moore. A global view of epistasis. Nat. Genet., 37:13–14, 2005.V. K. Mootha, C. M. Lindgren, K.-F. Eriksson, A. Subramanian, S. Sihag, J. Lehar,P. Puigserver, E. Carlsson, M. Ridderstråle, E. Laurila, N. Houstis, M. J. Daly, N. Patterson,J. P. Mesirov, T. R. Golub, P. Tamayo, B. Spiegelman, E. S. Lander, J. N. Hirschhorn,D. Altshuler, and L. C. Groop. PGC-1α-responsive genes involved in oxidative phosphorylationare coordinately downregulated in human diabetes. Nat. Genet., 34:267–273, 2003.M. Morley, C. M. Molony, T. M. Weber, J. L. Devlin, K. G. Ewens, R. S. Spielman, and V. G.Cheung. Genetic analysis of genome-wide variation in human gene expression. Nature,430:743–747, 2004.271

BibliographyJ. D. Mosley and R. A. Keri. Cell cycle correlated genes dictate the prognostic power ofbreast cancer gene lists. BMC Med. Genom., 1:11, 2008.P. W. Mueller, J. J. Rogus, P. A. Cleary, Y. Zhao, A. M. Smiles, M. W. Steffes, et al. Geneticsof Kidneys in Diabetes (GoKinD) study: a genetics collection available for identifyinggenetic susceptibility factors for diabetic nephropathy in type 1 diabetes. J. Am. Soc.Nephrol., 17:1782–1790, 2006.S. Myers, L. Bottolo, C. Freeman, G. McVean, and P. Donnelly. A fine-scale map of recombinationrates and hotspots across the human genome. Science, 310:321–324, 2005.NHS. Clinical and Health Outcomes Knowledge Base Financial Year 2008–2009, 2010. accessedMarch 3rd, 2011.J. K. Nicholson and J. C. Lindon. Metabonomics. Nature, 455:1054–1056, 2008.L. Nisticó, C. Fagnani, I. Coto, S. Percopo, R. Cotichini, M. G. Limongelli, F. Paparo,S. D’Alfonso, M. Giordano, C. Sferlazzas, G. Magazzú, P. Momigliano-Richiardi, L. Greco,and M. A. Stazi. Concordance, disease progression, and heritability of coeliac disease inItalian twins. Gut, 55:803–808, 2006.J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 2nd edition, 2006.J. O. Ogutu, H.-P. Piepho, and T. Schulz-Streeck. A comparison of random forests, boostingand support vector machines for genomic selection. BMC proceedings, 5 Suppl 3(Suppl3):S11, 2011.K. L. Ong, B. M. Y. Cheung, Y. B. Man, C. P. Lau, and K. S. L. Lam. Prevalence, Awareness,Treatment, and Control of Hypertension Among United States Adults 1999-2004.Hypertension, 49:69–75, 2007.J. F. Oram and R. M. Lawn. ABCA1 : the gatekeeper for eliminating excess tissue cholesterol.Journal Of Lipid Research, 42:1173–1179, 2001.M. Osborne, B. Presnell, and B. Turlach. A new approach to variable selection in least squaresproblems. IMA J. Numer. Anal., 20:389–404, 2000.M. R. Osborne, B. Presnell, and B. Turlach. On the lasso and its dual. J. Comput. Graph.Stat., 9:319–337, 2000.S. Pagant and E. Miller. Transforming ER exit: protein secretion meets oncogenesis. NatureCell Biology, 13:525–6, 2011.N. Patterson, A. L. Price, and D. Reich. Population Structure and Eigenanalysis. PLoSGenet., 2:e190, 2006.272

BibliographyL. Pusztai, C. Mazouni, K. Anderson, Y. Wu, and W. F. Symmans. Molecular classificationof breast cancer: limitations and potential. The Oncologist, 11:868–877, 2006.R Development Core Team. R: A Language and Environment for Statistical Computing. RFoundation for Statistical Computing, Vienna, Austria, 2011. ISBN 3-900051-07-0.N. Rabbee and T. P. Speed. A genotype calling algorithm for affymetrix SNP arrays. Bioinformatics,22:7–12, 2006.S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukherjee, C. H. Yeang, M. Angelo, C. Ladd,M. Reich, E. Latulippe, J. P. Mesirov, T. Poggio, W. Gerald, M. Loda, E. S. Lander, andT. R. Golub. Multiclass cancer diagnosis using tumor gene expression signatures. Proc.Natl. Acad. Sci., 98:15149–15154, 2001.J. O. Ramsay and B. W. Silverman. Functional Data Analysis. Springer, 2nd edition, 2006.D. E. Reich and E. S. Lander. On the allelic spectrum of human disease. Trends Genet.,17:502–510, 2001.G. Ridgeway. The state of boosting. Computing Science and Statistics, 31:172–181, 1999.G. Ridgeway. gbm: Generalized Boosted Regression Models, 2012. R package version 1.6-3.2.M. V. Rockman. Reverse engineering the genotype-phenotype map with natural geneticvariation. Nature, 456:738–744, 2008.U. Roshan, S. Chikkagoudar, Z. Wei, K. Wang, and H. Hakonarson. Ranking causal variantsand associated regions in genome-wide association studies by the support vector machineand random forest. Nucl. Acids Res., 39:e62, 2011.R. Savage, Z. Ghahramani, J. Griffin, B. D. L. Cruz, and D. Wild. Discovering transcriptionalmodules by Bayesian data integration. Bioinformatics, 26:i158–i167, 2010.E. E. Schadt, J. Lamb, X. Yang, J. Zhu, S. Edwards, D. Guhathakurta, S. K. Sieberts,S. Monks, M. Reitman, C. Zhang, P. Y. Lum, A. Leonardson, R. Thieringer, J. M. Metzger,L. Yang, J. Castle, H. Zhu, S. F. Kash, T. A. Drake, A. Sachs, and A. J. Lusis. An integrativegenomics approach to infer causal associations between gene expression and disease. Nat.Genet., 17:710–717, 2005.R. Schilsky. Personalized medicine in oncology: the future is now. Nature Reviews DrugDiscovery, 9:363–366, 2010.M. Schmidt, D. Böhm, C. von Törne, E. Steiner, A. Puhl, H. Pilch, H.-A. Lehr, J. G.Hengstler, J. Kölbl, and M. Gehrmann. The Humoral Immune System Has a Key PrognosticImpact in Node-Negative Breast Cancer. Cancer Res., 68:5405–5413, 2008.274

BibliographyE. Schneider, M. Rolli-derkinderen, M. Arock, M. Dy, and M. Rolli-derkinderen. Trends inhistamine research : new functions during immune responses and hematopoiesis. Trends inImmunology, 23(5):255–263, 2002.B. Schölkopf and A. J. Smola. Learning with Kernels. MIT Press, 2002.N. J. Schork, S. S. Murray, K. A. Frazer, and E. J. Topol. Common vs. rare allele hypothesesfor complex disease. Curr. Opin. Genet. Dev., 19:212–219, 2009.V. Scotet, M. P. Audrézet, M. Roussey, G. Rault, M. Blayau, M. D. Braekeleer, and C. Férec.Impact of public health strategies on the birth prevalence of cystic fibrosis in Brittany,France. Hum. Genet., 113:280–285, 2003.E. Segal, M. Shapira, A. Regev, D. Pe’er, D. Botstein, D. Koller, and N. Friedman. Modulenetworks: identifying regulatory modules and their condition-specific regulators from geneexpression data. Nat. Genet., 34:166–176, 2003.S. Shalev-Shwartz and A. Tewari. Stochastic methods for l 1 regularized loss minimization.In ICML 2009, Proceedings of the 26th International Conference on Machine Learning,volume 26, 2009.J. Shendure. The beginning of the end for microarrays? Nat. Meth., 5:585–587, 2008.N. Shimizu, S. Noda, K. Katayama, H. Ichikawa, H. Kodama, and H. Miyoshi. Identificationof genes potentially involved in supporting hematopoietic stem cell activity of stromal cellline MC3T3-G2/PA6. International journal of hematology, 87:239–45, 2008.T. Sing, O. Sander, N. Beerenwinkel, and T. Lengauer. ROCR: visualizing classifier performancein R. Bioinformatics, 21:3940–3941, 2005.G. K. Smyth and T. Speed. Normalization of cDNA Microarray Data. Methods, 31:266–273,2003.G. K. Smyth. Linear models and empirical Bayes methods for assessing differential expressionin microarray experiments. Statist. Appl. Genet. Mol. Biol., 3, 2004.G. K. Smyth. Limma: linear models for microarray data. In R. Gentleman, V. Carey, S. Dudoit,and W. H. R. Irizarry, editors, Bioinformatics and Computational Biology Solutionsusing R and Bioconductor, pages 397–420. Springer, New York, 2005.S. Sonnenburg, A. Zien, and G. Rätsch. ARTS: accurate recognition of transcription startsin human. Bioinformatics, 22:e472–e480, 2006.T. Sørlie, C. M. Perou, R. Tibshirani, T. Aas, S. Geisler, H. Johnsen, T. Hastie, M. B.Eisen, M. van de Rijn, S. S. Jeffrey, T. Thorsen, H. Quist, J. C. Matese, P. O. Brown,275

BibliographyD. Botstein, P. E. Lønning, and A. L. Børresen-Dale. Gene expression patterns of breastcarcinomas distinguish tumor subclasses with clinical implications. Proc. Natl. Acad. Sci.,98:10869–10874, 2001.T. Sørlie, R. Tibshirani, J. Parker, T. Hastie, J. S. Marron, A. Nobel, S. Deng, H. Johnsen,R. Pesich, S. Geisler, J. Demeter, C. M. Perou, P. E. Lønning, P. O. Brown, A.-L. Børresen-Dale, and D. Botstein. Repeated observation of breast tumor subtypes in independent geneexpression data sets. Proc. Natl. Acad. Sci., 100:8418–8423, 2003.C. Sotiriou and M. J. Piccart. Taking gene-expression profiling to the clinic: when willmolecular signatures become relevant to patient care? Nat. Rev. Cancer, 7:545–553, 2007.C. Sotiriou and L. Pusztai. Gene-expression signatures in breast cancer. N. Eng. J. Med.,360:790–800, 2009.C. Sotiriou, P. Wirapati, S. Loi, A. Harris, S. Fox, J. Smeds, H. Nordgren, P. Farmer, V. Praz,B. Haibe-Kains, C. Desmedt, D. Larsimont, F. Cardoso, H. Peterse, D. Nuyten, M. Buyse,M. J. V. de Vijver, J. Bergh, M. Piccart, and M. Delorenzi. Gene Expression Profilingin Breast Cancer: Understanding the Molecular Basis of Histologic Grade To ImprovePrognosis. J. Natl. Cancer Inst., 98:262–272, 2006.S. Soumian, C. Albrecht, A. Davies, and R. Gibbs. ABCA1 and atherosclerosis. VascularMedicine, 10(2):109–120, May 2005.F. J. Staal, M. van der Burg, L. F. Wessels, B. H. Barendregt, M. R. Baert, C. M. van denBurg, C. van Huffel, A. W. Langerak, V. H. van der Velden, M. J. Reinders, and J. J. vanDongen. Dna microarrays for comparison of gene expression profiles between diagnosis andrelapse in precursor-b acute lymphoblastic leukemia: choice of technique and purificationinfluence the identification of potential diagnostic markers. Leukemia, 17:1324–1332, 2003.C. Staiger, S. Cadot, R. Kooter, M. Dittrich, T. Mueller, G. W. Klau, and L. F. A. Wessels.A critical evaluation of network and pathway based classifiers for outcome prediction inbreast cancer. arXiv:1110.3717v2, 2011.J. D. Storey and R. Tibshirani. Statistical significance for genomewide studies. Proc. Natl.Acad. Sci., 100:9440–9445, 2003.B. E. Stranger, M. S. Forrest, A. G. Clark, M. J. Minichiello, S. Deutsch, R. Lyle, S. Hunt,B. Kahl, S. E. Antonarakis, S. Tavaré, P. Deloukas, and E. T. Dermitzakis. Genome-wideassociations of gene expression variation in humans. PLoS Genetics, 1:e78, 2005.B. E. Stranger, A. C. Nica, M. S. Forrest, A. Dimas, C. P. Bird, C. Beazley, C. E. Ingle,M. Dunning, P. Flicek, D. Koller, S. Montgomery, S. Tavaré, P. Deloukas, and E. T.276

BibliographyThe Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000cases of seven common diseases and 3,000 shared controls. Nature, 447:661–678, 2007.L. Tian, S. A. Greenberg, S. W. Kong, J. Altschuler, I. S. Kohane, and P. J. Park. Discoveringstatistically significant pathways in expression profiling studies. Proc. Natl. Acad. Sci.,102:13544–13549, 2005.R. J. Tibshirani and B. Efron. Pre-validation and inference in microarrays. Statist. Appl.Genet. Mol. Biol., 1:1, 2002.R. Tibshirani, T. Hastie, B. Narasimhan, and G. Chu. Class Prediction by Nearest ShrunkenCentroids, with Applications to DNA Microarrays. Stat. Sci., 18:104–117, 2003.R. Tibshirani. Regression Shrinkage and Selection via the Lasso. J. R. Statist. Soc. B,58:267–288, 1996.P. Törönen, P. J. Ojala, P. Maartinen, and L. Holm. Robust extraction of functional signalsfrom gene set analysis using a generalized threshold free scoring function. BMC Bioinfo.,10:307, 2009.G. Trynka, K. A. Hunt, N. A. Bockett, J. Romanos, V. Mistry, A. Szperl, S. F. Bakker, M. T.Bardella, L. Bhaw-Rosun, et al. Dense genotyping identifies and localizes multiple commonand rare variant association signals in celiac disease. Nat. Genet., 2011. advance onlinepublication.P. Tseng. Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization.J. Opt. Theory Appl., 109:475–494, 2001.T. Tukiainen, J. Kettunen, A. J. Kangas, L.-P. Lyytikåinen, et al. Detailed metabolic andgenetic characterization reveals new associations for 30 known lipid loci. Hum. Mol. Genet.,2011. To appear.J. Y. Uriu-Adams and C. L. Keen. Copper, oxidative stress, and human health. Molecularaspects of medicine, 26:268–98, 2005.M. J. van de Vijver, Y. D. He, L. J. van ’t Veer, H. Dai, A. A. M. Hart, D. W. Voskuil, G. J.Schreiber, J. L. Peterse, C. Roberts, M. J. Marton, M. Parrish, D. Atsma, A. Witteveen,A. Glas, L. Delahaye, T. van der Velde, H. Bartelink, S. Rodenhuis, E. T. Rutgers, S. H.Friend, and R. Bernards. A gene-expression signature as a predictor of survival in breastcancer. New Engl. J. Med., 347:1999–2009, 2002.A. J. Van der Kooij. Prediction Accuracy and Stability of Regression with Optimal ScalingTransformations. PhD thesis, Faculty of Social and Behavioural Sciences, Leiden University,2007.278

BibliographyP. J. van Diest, E. van der Wall, and J. P. A. Baak. Prognostic value of proliferation ininvasive breast cancer: a review. J. Clin. Pathol., 57:675–681, 2004.D. A. van Heel and J. West. Recent advances in coeliac disease. Gut, 55:1037–1046, 2006.D. A. van Heel, L. Franke, K. A. Hunt, R. Gwilliam, et al. A genome-wide association studyfor celiac disease identifies risk variants in the region harboring il2 and il21. Nat. Genet.,39:827–829, 2007.L. J. van ’t Veer, H. Dai, M. J. van de Vijver, Y. D. He, A. A. M. Hart, M. Mao, H. L.Peterse, K. van der Kooy, M. J. Marton, A. T. Witteveen, G. J. Schreiber, R. M. Kerkhoven,C. Roberts, P. S. Linsley, R. Bernards, and S. H. Friend. Gene expression profiling predictedclinical outcome of breast cancer. Nature, 415:530–536, 2002.M. H. van Vliet, C. N. Klijn, L. F. A. Wessels, and M. J. T. Reinders. Module-Based OutcomePrediction Using Breast Cancer Compendia. PLoS ONE, 2, 2007.D. Venet, J. E. Dumont, and V. Detours. Most Random Gene Expression Signatures AreSignificantly Associated with Breast Cancer Outcome. PLoS Comp. Biol., 7:e1002240,2011.J. C. Venter, M. D. Adams, E. W. Myers, P. W. Li, R. J. Mural, G. G. Sutton, H. O. Smith,M. Yandell, C. A. Evans, R. A. Holt, J. D. Gocayne, P. Amanatides, R. M. Ballew, D. H.Huson, J. R. Wortman, Q. Zhang, C. D. Kodira, X. H. Zheng, L. Chen, M. Skupski, G. Subramanian,P. D. Thomas, J. Zhang, G. L. G. Miklos, C. Nelson, S. Broder, A. G. Clark,J. Nadeau, V. A. McKusick, N. Zinder, A. J. Levine, R. J. Roberts, M. Simon, C. Slayman,M. Hunkapiller, R. Bolanos, A. Delcher, I. Dew, D. Fasulo, M. Flanigan, L. Florea,A. Halpern, S. Hannenhalli, S. Kravitz, S. Levy, C. Mobarry, K. Reinert, K. Remington,J. Abu-Threideh, E. Beasley, K. Biddick, V. Bonazzi, R. Brandon, M. Cargill, I. Chandramouliswaran,R. Charlab, K. Chaturvedi, Z. Deng, V. D. Francesco, P. Dunn, K. Eilbeck,C. Evangelista, A. E. Gabrielian, W. Gan, W. Ge, F. Gong, Z. Gu, P. Guan, T. J.Heiman, M. E. Higgins, R. R. Ji, Z. Ke, K. A. Ketchum, Z. Lai, Y. Lei, Z. Li, J. Li, Y. Liang,X. Lin, F. Lu, G. V. Merkulov, N. Milshina, H. M. Moore, A. K. Naik, V. A. Narayan,B. Neelam, D. Nusskern, D. B. Rusch, S. Salzberg, W. Shao, B. Shue, J. Sun, Z. Wang,A. Wang, X. Wang, J. Wang, M. Wei, R. Wides, C. Xiao, C. Yan, A. Yao, J. Ye, M. Zhan,W. Zhang, H. Zhang, Q. Zhao, L. Zheng, F. Zhong, W. Zhong, S. Zhu, S. Zhao, D. Gilbert,S. Baumhueter, G. Spier, C. Carter, A. Cravchik, T. Woodage, F. Ali, H. An, A. Awe,D. Baldwin, H. Baden, M. Barnstead, I. Barrow, K. Beeson, D. Busam, A. Carver, A. Center,M. L. Cheng, L. Curry, S. Danaher, L. Davenport, R. Desilets, S. Dietz, K. Dodson,L. Doup, S. Ferriera, N. Garg, A. Gluecksmann, B. Hart, J. Haynes, C. Haynes, C. Heiner,S. Hladun, D. Hostin, J. Houck, T. Howland, C. Ibegwam, J. Johnson, F. Kalush, L. Kline,279

BibliographyS. Koduru, A. Love, F. Mann, D. May, S. McCawley, T. McIntosh, I. McMullen, M. Moy,L. Moy, B. Murphy, K. Nelson, C. Pfannkoch, E. Pratts, V. Puri, H. Qureshi, M. Reardon,R. Rodriguez, Y. H. Rogers, D. Romblad, B. Ruhfel, R. Scott, C. Sitter, M. Smallwood,E. Stewart, R. Strong, E. Suh, R. Thomas, N. N. Tint, S. Tse, C. Vech, G. Wang, J. Wetter,S. Williams, M. Williams, S. Windsor, E. Winn-Deen, K. Wolfe, J. Zaveri, K. Zaveri,J. F. Abril, R. Guig, M. J. Campbell, K. V. Sjolander, B. Karlak, A. Kejariwal, H. Mi,B. Lazareva, T. Hatton, A. Narechania, K. Diemer, A. Muruganujan, N. Guo, S. Sato,V. Bafna, S. Istrail, R. Lippert, R. Schwartz, B. Walenz, S. Yooseph, D. Allen, A. Basu,J. Baxendale, L. Blick, M. Caminha, J. Carnes-Stine, P. Caulk, Y. H. Chiang, M. Coyne,C. Dahlke, A. Mays, M. Dombroski, M. Donnelly, D. Ely, S. Esparham, C. Fosler, H. Gire,S. Glanowski, K. Glasser, A. Glodek, M. Gorokhov, K. Graham, B. Gropman, M. Harris,J. Heil, S. Henderson, J. Hoover, D. Jennings, C. Jordan, J. Jordan, J. Kasha, L. Kagan,C. Kraft, A. Levitsky, M. Lewis, X. Liu, J. Lopez, D. Ma, W. Majoros, J. McDaniel,S. Murphy, M. Newman, T. Nguyen, N. Nguyen, M. Nodell, S. Pan, J. Peck, M. Peterson,W. Rowe, R. Sanders, J. Scott, M. Simpson, T. Smith, A. Sprague, T. Stockwell,R. Turner, E. Venter, M. Wang, M. Wen, D. Wu, M. Wu, A. Xia, A. Zandieh, and X. Zhu.The sequence of the human genome. Science, 291:1304–1351, 2001.J.-B. Veyrieras, S. Kudaravalli, S.-Y. Kim, E. T. Dermitzakis, Y. Gilad, M. Stephens, andJ. K. Pritchard. High-Resolution Mapping of Expression-QTLs Yields Insight into HumanGene Regulation. PLoS Genet., 4:e1000214, 2008.P. M. Visscher, W. G. Hill, and N. R. Wray. Heritability in the genomics era–concepts andmisconceptions. Nat. Rev. Genet., 9:255–66, 2008.Y.-H. Wang and T. P. Speed. Design and analysis of comparative microarray experiments. InT. P. Speed, editor, Statistical Analysis of Gene Expression Microarray Data. CRC Press,2003.Y. Wang, J. G. Klijn, Y. Zhang, A. M. Sieuwerts, M. P. Look, F. Yang, D. Talantov,M. Timmermans, M. M. van Gelder, J. Yu, T. Jatkoe, E. M. Berns, D. Atkins, and J. A.Foekens. Gene-expression profiles to predict distant metastasis of lymph-node-negativeprimary breast cancer. The Lancet, 365:671–679, 2005.Z. Wang, M. Gerstein, and M. Snyder. RNA-Seq: a revolutionary tool for transcriptomics.Nat. Rev. Genet., 10:57–63, 2009.T. J. Wang, M. G. Larson, R. S. Vasan, S. Cheng, E. P. Rhee, E. McCabe, G. D. Lewis, C. S.Fox, P. F. Jacques, C. Fernandez, C. J. O’Donnell, S. a. Carr, V. K. Mootha, J. C. Florez,A. Souza, O. Melander, C. B. Clish, and R. E. Gerszten. Metabolite profiles and the riskof developing diabetes. Nat. Med., 17:448–453, 2011.280

BibliographyZ. Wei and H. Li. Nonparametric pathway-based regression models for analysis of genomicdata. Biostatistics, 8(2):265–84, 2007.Z. Wei, K. Wang, H. Q. Qu, H. Zhang, J. Bradfield, C. Kim, E. Frackleton, et al. Fromdisease association to risk assessment: an optimistic view from genome-wide associationstudies on type 1 diabetes. PLoS Genet., 5:e1000678, 2009.S. Weidinger, C. Gieger, E. Rodriguez, H. Baurecht, M. Mempel, N. Klopp, H. Gohlke, S. Wagenpfeil,M. Ollert, J. Ring, H. Behrendt, J. Heinrich, N. Novak, T. Bieber, U. Krämer,D. Berdel, A. von Berg, C. P. Bauer, O. Herbarth, S. Koletzko, H. Prokisch, D. Mehta,T. Meitinger, M. Depner, E. von Mutius, L. Liang, M. Moffatt, W. Cookson, M. Kabesch,H.-E. Wichmann, and T. Illig. Genome-wide scan on total serum IgE levels identifiesFCER1A as novel susceptibility locus. PLoS genetics, 4(8):e1000166, 2008.B. Weigelt, J. L. Peterse, and L. J. van ’t Veer. Breast cancer metastasis: markers and models.Nat. Rev. Cancer, 5:591–602, 2005.Wellcome Trust Case Control Consortium. Genome-wide association study of CNVs in 16,000cases of eight common diseases and 3,000 shared controls. Nature, 464:713–720, 2010.J. Weston, A. Elisseeff, B. Schölkopf, and M. Tipping. Use of the zero-norm with linearmodels and kernel methods. J. Mach. Learn. Res., 3:1439–1461, 2003.J. Whittaker. Graphical Models in Applied Multivariate Statistics. John Wiley & Sons, 1990.P. Wirapati, C. Sotiriou, P. Farmer, S. Pradervand, B. Haibe-Kains, C. Desmedt, M. Ignatiadis,T. Sengstag, F. Schutz, D. Goldstein, M. Piccart, and M. Delorenzi. Meta-analysisof gene expression profiles in breast cancer: toward a unified understanding of breast cancersubtyping and prognosis signatures. Breast Cancer Res., 10:R65, 2008.B. S. Wittner, D. C. Sgroi, P. D. Ryan, T. J. Bruinsma, A. M. Glas, A. Male, S. Dahiya,K. Habin, R. Bernards, D. A. Haber, L. J. V. Veer, and S. Ramaswamy. Analysis of themammaprint breast cancer assay in a predominantly postmenopausal cohort. Clin. CancerRes., 14:2988–2993, 2008.N. R. Wray, J. Yang, M. E. Goddard, and P. M. Visscher. The Genetic Interpretation of Areaunder the ROC Curve in Genomic Profiling. PLoS Genet., 6:e1000864, 2010.T.-T. Wu, Y. F. Chen, T. Hastie, E. Sobel, and K. Lange. Genome-wide association analysisby lasso penalized logistic regression. Bioinformatics, 25:714–721, 2009.X. Xie, J. Lu, E. Kulbokas, T. Golub, V. Mootha, K. Lindblad-Toh, E. Lander, and M. Kellis.Systematic discovery of regulatory motifs in human promoters and 3’ UTRs by comparisonof several mammals. Nature, 434:338–345, 2005.281

BibliographyC. Yang, X. Wan, Q. Yang, H. Xue, and W. Yu. Identifying main effects and epistaticinteractions from large-scale SNP data via adaptive group Lasso. BMC Bioinfo., 11 (Suppl1):S18, 2010.J. Yang, S. H. Lee, M. E. Goddard, and P. M. Visscher. GCTA: A Tool for Genome-wideComplex Trait Analysis. Am. J. Hum. Genet., 88:76–82, 2011.M. Yousef, S. Jung, L. C. Showe, and M. K. Showe. Recursive cluster elimination (RCE) forclassification and feature selection from gene expression data. BMC Bioinfo., 8:article 144,2007.H.-F. Yu, C.-J. Hsieh, K.-W. Chang, and C.-J. Lin. Large linear classification when datacannot fit in memory. In 16th ACM KDD, 2010.E. Zeggini, L. J. Scott, R. Saxena, B. F. Voight, and the DIAGRAM Consortium. Metaanalysisof genome-wide association data and large-scale replication identifies susceptibilityloci for type 2 diabetes. Nat. Genet., 40:638–645, 2008.P. Zhao and B. Yu. On model selection consistency of lasso. J. Mach. Learn. Res., 7:2541–2563, 2006.J. Zhu and T. Hastie. Kernel logistic regression and the import vector machine. J. Comput.Graph. Stat., 14:185–205, 2005.H. Zou and T. Hastie. Regularization and variable selection via the elastic net. J. R. Statist.Soc. B, 67:301–320, 2005.H. Zou. The adaptive lasso and its oracle properties. J. Amer. Stat. Assoc., 101:1418–1429,2006.282

Scalable approaches for analysis of human genome-wide ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?