11.07.2015 Views

Statistics for high-dimensional data - Seminar für Statistik - ETH Zürich

Statistics for high-dimensional data - Seminar für Statistik - ETH Zürich

Statistics for high-dimensional data - Seminar für Statistik - ETH Zürich

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Statistics</strong> <strong>for</strong> <strong>high</strong>-<strong>dimensional</strong> <strong>data</strong>:Introduction, and the Lasso <strong>for</strong> linear modelsPeter Bühlmann and Sara van de Geer<strong>Seminar</strong> für <strong>Statistik</strong>, <strong>ETH</strong> ZürichMay 2012


yyy7.6 8.0 8.4 8.8xsel7.5 8.0 8.5 9.0xsel8 9 10 11 12xselyyy7 8 9 10 11 12xsel7.0 7.5 8.0 8.5 9.0xsel7.5 8.0 8.5 9.0xselyyy6 7 8 9 10xsel8 9 10 11High-<strong>dimensional</strong> <strong>data</strong>Riboflavin production with Bacillus Subtilis(in collaboration with DSM (Switzerland))goal: improve riboflavin production rate of Bacillus Subtilisusing clever genetic engineeringresponse variables Y ∈ R: riboflavin (log-) production ratecovariates X ∈ R p : expressions from p = 4088 genessample size n = 115, p ≫ ngene expression <strong>data</strong>xsel8.5 9.0 9.5 10.0 11.0xselY versus 9 “reasonable” genes−10 −9 −8 −7 −6−10 −9 −8 −7 −6−10 −9 −8 −7 −6−10 −9 −8 −7 −6−10 −9 −8 −7 −6−10 −9 −8 −7 −6−10 −9 −8 −7 −6−10 −9 −8 −7 −6−10 −9 −8 −7 −6


general framework:Z 1 , . . . , Z n (with some ”i.i.d. components”)dim(Z i ) ≫ n<strong>for</strong> example:Z i = (X i , Y i ), X i ∈ X ⊆ R p , Y i ∈ Y ⊆∈ R: regression with p ≫ nZ i = (X i , Y i ), X i ∈ X ⊆∈ R p , Y i ∈ {0, 1}: classification <strong>for</strong> p ≫ nnumerous applications:biology, imaging, economy, environmental sciences, ...


High-<strong>dimensional</strong> linear modelsgoals:Y i =p ≫ np∑j=1β 0 j X (j)iin short: Y = Xβ + ε+ ε i , i = 1, . . . , n◮ prediction, e.g. w.r.t. squared prediction error◮ estimation of β 0 , e.g. w.r.t. ‖ ˆβ − β 0 ‖ q (q = 1, 2)◮ variable selectioni.e. estimating the active set with the effective variables(having corresponding coefficient ≠ 0)


we need to regularize...and there are many proposals◮ Bayesian methods <strong>for</strong> regularization◮ greedy algorithms: aka <strong>for</strong>ward selection or boosting◮ preliminary dimension reduction◮ ...e.g. 3’190’000 entries on Google Scholar <strong>for</strong>“<strong>high</strong> <strong>dimensional</strong> linear model” ...


we need to regularize...and there are many proposals◮ Bayesian methods <strong>for</strong> regularization◮ greedy algorithms: aka <strong>for</strong>ward selection or boosting◮ preliminary dimension reduction◮ ...e.g. 3’190’000 entries on Google Scholar <strong>for</strong>“<strong>high</strong> <strong>dimensional</strong> linear model” ...


we need to regularize...and there are many proposals◮ Bayesian methods <strong>for</strong> regularization◮ greedy algorithms: aka <strong>for</strong>ward selection or boosting◮ preliminary dimension reduction◮ ...e.g. 3’190’000 entries on Google Scholar <strong>for</strong>“<strong>high</strong> <strong>dimensional</strong> linear model” ...


Penalty-based methodsif true β 0 is sparse w.r.t.◮ ‖β 0 ‖ 0 0= number of non-zero coefficients❀ regularize with the ‖ · ‖ 0 -penalty:argmin β (n −1 ‖Y − Xβ‖ 2 + λ‖β‖ 0 0), e.g. AIC, BIC❀ computationally infeasible if p is large (2 p sub-models)◮ ‖β 0 ‖ 1 = ∑ pj=1 |β0 j|❀ penalize with the ‖ · ‖ 1 -norm, i.e. Lasso:argmin β (n −1 ‖Y − Xβ‖ 2 + λ‖β‖ 1 )❀ convex optimization:computationally feasible and very fast <strong>for</strong> large p


The Lasso (Tibshirani, 1996)Lasso <strong>for</strong> linear modelsˆβ(λ) = argmin β (n −1 ‖Y − Xβ‖ 2 + }{{} λ≥0❀ convex optimization problem◮ Lasso does variable selectionsome of the ˆβ j (λ) = 0(because of “l 1 -geometry”)◮ ˆβ(λ) is a shrunken LS-estimate‖β‖ 1} {{ }P pj=1 |β j |)


more about “l 1 -geometry”equivalence to primal problemˆβ primal (R) = argmin β;‖β‖1 ≤R ‖Y − Xβ‖2 2 /n,with a correspondence between λ and R which depends on the<strong>data</strong> (X 1 , Y 1 ), . . . , (X n , Y n )[such an equivalence holds since◮ ‖Y − Xβ‖ 2 2/n is convex in β◮ convex constraint ‖β‖ 1 ≤ Rsee e.g. Bertsekas (1995)]


p=2left: l 1 -“world”residual sum of squares reaches a minimal value (<strong>for</strong> certainconstellations of the <strong>data</strong>) if its contour lines hit the l 1 -ball in itscorner❀ ˆβ 1 = 0


l 2 -“world” is differentRidge regression,()ˆβ Ridge (λ) = argmin β ‖Y − Xβ‖ 2 2 /n + λ‖β‖2 2 ,equivalent primal equivalent solutionˆβ Ridge;primal (R) = argmin β;‖β‖2 ≤R ‖Y − Xβ‖2 2 /n,with a one-to-one correspondence between λ and R


A note on the Bayesian approachmodel:posterior density:β 1 , . . . , β p i.i.d. ∼ p(β)dβ,given β : Y ∼ N n (Xβ, σ 2 I n ) with density f (y|σ 2 , β)p(β|Y, σ 2 ) =f (Y|β, σ2 )p(β)∫f (Y|β, σ 2 )p(β)dβ ∝ f (Y|β, σ2 )p(β)and hence <strong>for</strong> the MAP (Maximum A-Posteriori) estimator:()ˆβ MAP = argmax β p(β|Y, σ 2 ) = argmin β − log f (Y|β, σ 2 )p(β)⎛⎞= argmin β⎝ 1p∑2σ 2 ‖Y − Xβ‖2 2 − log(p(β j )) ⎠j=1


examples:1. Double-Exponential prior DExp(ξ):p(β) = τ 2 exp(−τβ)❀ ˆβ MAP equals the Lasso with penalty parameter λ = n −1 2σ 2 τ2. Gaussian prior N (0, τ 2 ):p(β) = 1 √2πτexp(−β 2 /(2τ 2 ))❀ ˆβ MAP equals the Ridge estimator with penalty parameterλ = n −1 σ 2 /τ 2but we will argue that Lasso is also good if the truth is sparsewith respect to ‖β 0 ‖ 0 0, e.g. if prior is (much) more spiky aroundzero than Double-Exponential distribution


Orthonormal designY = Xβ + ε, n −1 X T X = ILasso = soft-thresholding estimatorˆβ j (λ) = sign(Z j )(|Z j | − λ/2) + , Z j = (n −1 X T Y) j ,}{{}=OLSˆβ j (λ) = g soft (Z j ),[this is Exercise 2.1]threshold functions−3 −2 −1 0 1 2 3Adaptive LassoHard−thresholdingSoft−thresholding−3 −2 −1 0 1 2 3z


Predictiongoal: predict a new observation Y newconsider expected (w.r.t. new <strong>data</strong>; and random X) squarederror loss:E Xnew,Y new[(Y new − X new ˆβ) 2 ] = σ 2 + E Xnew [(X new (β 0 − ˆβ)) 2 ]= σ 2 + ( ˆβ − β 0 ) T }{{} Σ ( ˆβ − β 0 )Cov(X)❀ terminology “prediction error”:<strong>for</strong> random design X: ( ˆβ − β 0 ) T Σ( ˆβ − β 0 ) = E Xnew [(X new ( ˆβ − β 0 )) 2 ]<strong>for</strong> fixed design X: ( ˆβ − β 0 ) T ˆΣ( ˆβ − β 0 ) = ‖X( ˆβ − β 0 )‖ 2 2 /n


inary lymph node classification using gene expressions:a <strong>high</strong> noise problemn = 49 samples, p = 7130 gene expressionsdespite that it is classification: P[Y = 1|X = x] = E[Y |X = x]❀ ˆp(x) via linear model; can then do classificationcross-validated misclassification error (2/3 training; 1/3 test)Lasso L 2 Boosting FPLR Pelora 1-NN DLDA SVM21.1% 17.7% 35.25% 27.8% 43.25% 36.12% 36.88%with variable selectionbest 200 genes (Wilcoxon test)no additional variable selectionLasso selected on CV-average 13.12 out of p = 7130 genesfrom a practical perspective:if you trust in cross-validation: can validate how good we arei.e. prediction may be a black box, but we can evaluate it!


inary lymph node classification using gene expressions:a <strong>high</strong> noise problemn = 49 samples, p = 7130 gene expressionsdespite that it is classification: P[Y = 1|X = x] = E[Y |X = x]❀ ˆp(x) via linear model; can then do classificationcross-validated misclassification error (2/3 training; 1/3 test)Lasso L 2 Boosting FPLR Pelora 1-NN DLDA SVM21.1% 17.7% 35.25% 27.8% 43.25% 36.12% 36.88%with variable selectionbest 200 genes (Wilcoxon test)no additional variable selectionLasso selected on CV-average 13.12 out of p = 7130 genesfrom a practical perspective:if you trust in cross-validation: can validate how good we arei.e. prediction may be a black box, but we can evaluate it!


and in fact: we will hear that◮ Lasso is consistent <strong>for</strong> prediction assuming “essentiallynothing”◮ Lasso is optimal <strong>for</strong> prediction assuming the “compatibilitycondition” <strong>for</strong> X


Estimation of regression coefficientsY = Xβ 0 + ε, p ≫ nwith fixed (deterministic) design Xproblem of identifiability:<strong>for</strong> p > n: Xβ 0 = Xθ<strong>for</strong> any θ = β 0 + ξ, ξ in the null-space of X❀ cannot say anything about ‖ ˆβ − β 0 ‖ without furtherassumptions!❀ we will work with the compatibility assumption (see later bySara)and Sara will explain: under compatibility condition‖ ˆβ − β 0 ‖ 1 ≤ C s √0 log(p)/n,φ 2 0s 0 = |supp(β 0 )| = |{j; βj 0 ≠ 0}|


Variable selectionExample: Motif regression<strong>for</strong> finding HIF1α transcription factor binding sites in DNA seq.Müller, Meier, PB & RicciY i ∈ R: univariate response measuring binding intensity ofHIF1α on coarse DNA segment i (from CHIP-chip experiments)X i = (X (1)i, . . . , X (p)i) ∈ R p :X (j)i= abundance score of candidate motif j in DNA segment i(using sequence <strong>data</strong> and computational biology algorithms,e.g. MDSCAN)


question: relation between the binding intensity Y and theabundance of short candidate motifs?❀ linear model is often reasonable“motif regression” (Conlon, X.S. Liu, Lieb & J.S. Liu, 2003)Y = Xβ + ɛ, n = 287, p = 195goal: variable selection❀ find the relevant motifs among the p = 195 candidates


Lasso <strong>for</strong> variable selectionŜ(λ) = {j; ˆβ j (λ) ≠ 0}<strong>for</strong> S 0 = {j; β 0 j ≠ 0}no significance testing involvedit’s convex optimization only!(and that can be a problem... see later)


Motif regression<strong>for</strong> finding HIF1α transcription factor binding sites in DNA seq.Y i ∈ R: univariate response measuring binding intensity oncoarse DNA segment i (from CHIP-chip experiments)= abundance score of candidate motif j in DNA segment iX (j)ivariable selection in linear model Y i = β 0 +i = 1, . . . , n = 287, p = 195❀ Lasso selects 26 covariates and R 2 ≈ 50%i.e. 26 interesting candidate motifsp∑j=1β j X (j)i+ ε i ,


motif regression: estimated coefficients ˆβ(ˆλ CV )original <strong>data</strong>coefficients0.00 0.05 0.10 0.15 0.200 50 100 150 200variables


“Theory” <strong>for</strong> variable selection with Lasso<strong>for</strong> (fixed design) linear model Y = Xβ 0 + ε withactive set S 0 = {j; β 0 j≠ 0}two key assumptions1. neighborhood stability condition <strong>for</strong> design X⇔ irrepresentable condition <strong>for</strong> design X2. beta-min conditionmin |βj 0 | ≥ C √ s 0 log(p)/n, C suitably largej∈S 0both conditions are sufficient and “essentially” necessary <strong>for</strong>Ŝ(λ) = S 0 with <strong>high</strong> probability,λ ≫√log(p)/n} {{ }larger than <strong>for</strong> pred.already proved in Meinshausen & PB, 2004 (publ: 2006)and both assumptions are restrictive!


“Theory” <strong>for</strong> variable selection with Lasso<strong>for</strong> (fixed design) linear model Y = Xβ 0 + ε withactive set S 0 = {j; β 0 j≠ 0}two key assumptions1. neighborhood stability condition <strong>for</strong> design X⇔ irrepresentable condition <strong>for</strong> design X2. beta-min conditionmin |βj 0 | ≥ C √ s 0 log(p)/n, C suitably largej∈S 0both conditions are sufficient and “essentially” necessary <strong>for</strong>Ŝ(λ) = S 0 with <strong>high</strong> probability,λ ≫√log(p)/n} {{ }larger than <strong>for</strong> pred.already proved in Meinshausen & PB, 2004 (publ: 2006)and both assumptions are restrictive!


neighborhood stability condition ⇔ irrepresentable condition(Zhao & Yu, 2006)n −1 X T X = ˆΣactive set S 0 = {j; β j ≠ 0} = {1, . . . , s 0 } consists of the first s 0variables; partition)ˆΣ =( ˆΣS0 ,S 0ˆΣS0 ,S c 0ˆΣ S c0 ,S 0ˆΣS c0 ,S c 0irrep. condition :‖ˆΣ S c0 ,S 0 ˆΣ−1 S 0 ,S 0sign(β 0 1 , . . . , β0 s 0) T ‖ ∞ < 1


not very realistic assumptions... what can we expect?recall: under compatibility condition‖ ˆβ − β 0 ‖ 1 ≤ C s √0 log(p)/nφ 2 0consider the relevant active variablesS relev = {j; |βj 0 | > C s √0 log(p)/n}φ 2 0then, clearly,Ŝ ⊇ S relev with <strong>high</strong> probabilityscreening <strong>for</strong> detecting the relevant variables is possible!without beta-min condition and assuming compatibilitycondition only


in addition: assuming beta-min conditionmin |βj 0 | > C s √0 log(p)/nj∈S 0 φ 2 0Ŝ ⊇ S 0 with <strong>high</strong> probabilityscreening <strong>for</strong> detecting the true variables


Tibshirani (1996):LASSO = Least Absolute Shrinkage and Selection Operatornew translation:LASSO = Least Absolute Shrinkage and Screening Operator


Practical perspectivechoice of λ: ˆλ CV from cross-validationempirical and theoretical indications (Meinshausen & PB, 2006)thatmoreoverŜ(ˆλ CV ) ⊇ S 0 (or S relev )|Ŝ(ˆλ CV )| ≤ min(n, p)(= n if p ≫ n)❀ huge <strong>dimensional</strong>ity reduction (in the original covariates)


motif regression: estimated coefficients ˆβ(ˆλ CV )original <strong>data</strong>coefficients0.00 0.05 0.10 0.15 0.200 50 100 150 200variableswhich variables in Ŝ are false positives?(p-values would be very useful!)


ecall:Ŝ(ˆλ CV ) ⊇ S 0 (or S relev )and we would then use a second-stage to reduce the numberof false positive selections❀ re-estimation on much smaller model with variables from Ŝ◮ OLS on Ŝ with e.g. BIC variable selection◮ thresholding coefficients and OLS re-estimation◮ adaptive Lasso (Zou, 2006)◮ ...


ecall:Ŝ(ˆλ CV ) ⊇ S 0 (or S relev )and we would then use a second-stage to reduce the numberof false positive selections❀ re-estimation on much smaller model with variables from Ŝ◮ OLS on Ŝ with e.g. BIC variable selection◮ thresholding coefficients and OLS re-estimation◮ adaptive Lasso (Zou, 2006)◮ ...


Adaptive Lasso (Zou, 2006)re-weighting the penalty functionˆβ = argmin β (‖Y − Xβ‖ 2 2 /n + λ p∑j=1|β j || ˆβ init,j | ),ˆβ init,j from Lasso in first stage (or OLS if p < n)} {{ }Zou (2006)threshold functions<strong>for</strong> orthogonal design,if ˆβ init = OLS:Adaptive Lasso = NN-garrote❀ less bias than Lasso−3 −2 −1 0 1 2 3Adaptive LassoHard−thresholdingSoft−thresholding−3 −2 −1 0 1 2 3z


motif regressionLassoAdaptive Lassocoefficients−0.05 0.00 0.05 0.10 0.15 0.20 0.25coefficients−0.05 0.00 0.05 0.10 0.15 0.20 0.250 50 100 150 200variables0 50 100 150 200variablesLasso selects 26 variables Adaptive Lasso selects 16 variables


KKT conditions and Computationcharacterization of solution(s) ˆβ as minimizer of the criterionfunctionQ λ (β) = ‖Y − Xβ‖ 2 2 /n + λ‖β‖ 1since Q λ (·) is a convex function:necessary and sufficient that subdifferential of ∂Q λ (β)/∂β at ˆβcontains the zero elementLemma 2.1 first part (in the book)denote by G(β) = −2X T (Y − Xβ)/n the gradient vector of‖Y − Xβ‖ 2 2 /nThen: ˆβ is a solution if and only ifG j ( ˆβ) = −sign( ˆβ j )λ if ˆβ j ≠ 0,|G j ( ˆβ)| ≤ λ if ˆβ j = 0


Lemma 2.1 second part (in the book)If the solution of argmin β Q λ (β) is not unique (e.g. if p > n), andif G j ( ˆβ) < λ <strong>for</strong> some solution ˆβ, then ˆβ j = 0 <strong>for</strong> all (other)solutions ˆβ in argmin β Q λ (β).The zeroes are “essentially” unique(“essentially” refers to the situation: ˆβ j = 0 and G j ( ˆβ) = λ)Proof: Exercise (optional), or see in the book


Coordinate descent algorithm <strong>for</strong> computationgeneral idea is to compute a solution ˆβ(λ grid,k ) and use it as astarting value <strong>for</strong> the computation of ˆβ(λ grid,k−1} {{ }


<strong>for</strong> squared error loss: explicit up-dating <strong>for</strong>mulae (Exercise 2.7)G j (β) = −2X T j (Y − Xβ)β (m)j= sign(Z j)(|Z j | − λ/2) +ˆΣ jj,Z j = X T j (Y − Xβ −j )/n, ˆΣ = n −1 X T X.❀ componentwise soft-thresholdingthis is very fast if true problem is sparseactive set strategy: can do non-systematic cycling, visitingmainly the active (non-zero) componentsriboflavin example, n=71, p=40880.33 secs. CPU using glmnet-package in R(Friedman, Hastie & Tibshirani, 2008)


coordinate descent algorithm converges to a stationary point(Paul Tseng ≈ 2000)❀ convergence to a global optimum, due to convexity of theproblemmain assumption:objective function = smooth function + penalty} {{ }separablehere: “separable” means “additive”, i.e., pen(β) = ∑ pj=1 p j(β j )


failure of coordinate descent algorithm:Fused Lassop∑ˆβ = argmin β ‖Y − Xβ‖ 2 2 /n + λ |β j − β j−1 | + λ 2 ‖β‖ 1j=2but ∑ pj=2 |β j − β j−1 | is non-separablecontour lines of penalties <strong>for</strong> p = 2|beta1 − beta2||beta1| + |beta2|beta2−2 −1 0 1 233.5212.511.51.52.50.50.5323.5beta2−2 −1 0 1 22.533.53.532.51.50.5122.533.532.53.5−2 −1 0 1 2beta1−2 −1 0 1 2beta1

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!