Statistics for high-dimensional data - Seminar fÃ¼r Statistik - ETH ZÃ¼rich

Statistics for high-dimensional data:Introduction, and the Lasso for linear modelsPeter Bühlmann and Sara van de GeerSeminar für Statistik, ETH ZürichMay 2012

yyy7.6 8.0 8.4 8.8xsel7.5 8.0 8.5 9.0xsel8 9 10 11 12xselyyy7 8 9 10 11 12xsel7.0 7.5 8.0 8.5 9.0xsel7.5 8.0 8.5 9.0xselyyy6 7 8 9 10xsel8 9 10 11High-dimensional dataRiboflavin production with Bacillus Subtilis(in collaboration with DSM (Switzerland))goal: improve riboflavin production rate of Bacillus Subtilisusing clever genetic engineeringresponse variables Y ∈ R: riboflavin (log-) production ratecovariates X ∈ R p : expressions from p = 4088 genessample size n = 115, p ≫ ngene expression dataxsel8.5 9.0 9.5 10.0 11.0xselY versus 9 “reasonable” genes−10 −9 −8 −7 −6−10 −9 −8 −7 −6−10 −9 −8 −7 −6−10 −9 −8 −7 −6−10 −9 −8 −7 −6−10 −9 −8 −7 −6−10 −9 −8 −7 −6−10 −9 −8 −7 −6−10 −9 −8 −7 −6

general framework:Z 1 , . . . , Z n (with some ”i.i.d. components”)dim(Z i ) ≫ nfor example:Z i = (X i , Y i ), X i ∈ X ⊆ R p , Y i ∈ Y ⊆∈ R: regression with p ≫ nZ i = (X i , Y i ), X i ∈ X ⊆∈ R p , Y i ∈ {0, 1}: classification for p ≫ nnumerous applications:biology, imaging, economy, environmental sciences, ...

High-dimensional linear modelsgoals:Y i =p ≫ np∑j=1β 0 j X (j)iin short: Y = Xβ + ε+ ε i , i = 1, . . . , n◮ prediction, e.g. w.r.t. squared prediction error◮ estimation of β 0 , e.g. w.r.t. ‖ ˆβ − β 0 ‖ q (q = 1, 2)◮ variable selectioni.e. estimating the active set with the effective variables(having corresponding coefficient ≠ 0)

we need to regularize...and there are many proposals◮ Bayesian methods for regularization◮ greedy algorithms: aka forward selection or boosting◮ preliminary dimension reduction◮ ...e.g. 3’190’000 entries on Google Scholar for“high dimensional linear model” ...



Penalty-based methodsif true β 0 is sparse w.r.t.◮ ‖β 0 ‖ 0 0= number of non-zero coefficients❀ regularize with the ‖ · ‖ 0 -penalty:argmin β (n −1 ‖Y − Xβ‖ 2 + λ‖β‖ 0 0), e.g. AIC, BIC❀ computationally infeasible if p is large (2 p sub-models)◮ ‖β 0 ‖ 1 = ∑ pj=1 |β0 j|❀ penalize with the ‖ · ‖ 1 -norm, i.e. Lasso:argmin β (n −1 ‖Y − Xβ‖ 2 + λ‖β‖ 1 )❀ convex optimization:computationally feasible and very fast for large p

The Lasso (Tibshirani, 1996)Lasso for linear modelsˆβ(λ) = argmin β (n −1 ‖Y − Xβ‖ 2 + }{{} λ≥0❀ convex optimization problem◮ Lasso does variable selectionsome of the ˆβ j (λ) = 0(because of “l 1 -geometry”)◮ ˆβ(λ) is a shrunken LS-estimate‖β‖ 1} {{ }P pj=1 |β j |)

more about “l 1 -geometry”equivalence to primal problemˆβ primal (R) = argmin β;‖β‖1 ≤R ‖Y − Xβ‖2 2 /n,with a correspondence between λ and R which depends on thedata (X 1 , Y 1 ), . . . , (X n , Y n )[such an equivalence holds since◮ ‖Y − Xβ‖ 2 2/n is convex in β◮ convex constraint ‖β‖ 1 ≤ Rsee e.g. Bertsekas (1995)]

p=2left: l 1 -“world”residual sum of squares reaches a minimal value (for certainconstellations of the data) if its contour lines hit the l 1 -ball in itscorner❀ ˆβ 1 = 0

l 2 -“world” is differentRidge regression,()ˆβ Ridge (λ) = argmin β ‖Y − Xβ‖ 2 2 /n + λ‖β‖2 2 ,equivalent primal equivalent solutionˆβ Ridge;primal (R) = argmin β;‖β‖2 ≤R ‖Y − Xβ‖2 2 /n,with a one-to-one correspondence between λ and R

A note on the Bayesian approachmodel:posterior density:β 1 , . . . , β p i.i.d. ∼ p(β)dβ,given β : Y ∼ N n (Xβ, σ 2 I n ) with density f (y|σ 2 , β)p(β|Y, σ 2 ) =f (Y|β, σ2 )p(β)∫f (Y|β, σ 2 )p(β)dβ ∝ f (Y|β, σ2 )p(β)and hence for the MAP (Maximum A-Posteriori) estimator:()ˆβ MAP = argmax β p(β|Y, σ 2 ) = argmin β − log f (Y|β, σ 2 )p(β)⎛⎞= argmin β⎝ 1p∑2σ 2 ‖Y − Xβ‖2 2 − log(p(β j )) ⎠j=1

examples:1. Double-Exponential prior DExp(ξ):p(β) = τ 2 exp(−τβ)❀ ˆβ MAP equals the Lasso with penalty parameter λ = n −1 2σ 2 τ2. Gaussian prior N (0, τ 2 ):p(β) = 1 √2πτexp(−β 2 /(2τ 2 ))❀ ˆβ MAP equals the Ridge estimator with penalty parameterλ = n −1 σ 2 /τ 2but we will argue that Lasso is also good if the truth is sparsewith respect to ‖β 0 ‖ 0 0, e.g. if prior is (much) more spiky aroundzero than Double-Exponential distribution

Orthonormal designY = Xβ + ε, n −1 X T X = ILasso = soft-thresholding estimatorˆβ j (λ) = sign(Z j )(|Z j | − λ/2) + , Z j = (n −1 X T Y) j ,}{{}=OLSˆβ j (λ) = g soft (Z j ),[this is Exercise 2.1]threshold functions−3 −2 −1 0 1 2 3Adaptive LassoHard−thresholdingSoft−thresholding−3 −2 −1 0 1 2 3z

Predictiongoal: predict a new observation Y newconsider expected (w.r.t. new data; and random X) squarederror loss:E Xnew,Y new[(Y new − X new ˆβ) 2 ] = σ 2 + E Xnew [(X new (β 0 − ˆβ)) 2 ]= σ 2 + ( ˆβ − β 0 ) T }{{} Σ ( ˆβ − β 0 )Cov(X)❀ terminology “prediction error”:for random design X: ( ˆβ − β 0 ) T Σ( ˆβ − β 0 ) = E Xnew [(X new ( ˆβ − β 0 )) 2 ]for fixed design X: ( ˆβ − β 0 ) T ˆΣ( ˆβ − β 0 ) = ‖X( ˆβ − β 0 )‖ 2 2 /n

inary lymph node classification using gene expressions:a high noise problemn = 49 samples, p = 7130 gene expressionsdespite that it is classification: P[Y = 1|X = x] = E[Y |X = x]❀ ˆp(x) via linear model; can then do classificationcross-validated misclassification error (2/3 training; 1/3 test)Lasso L 2 Boosting FPLR Pelora 1-NN DLDA SVM21.1% 17.7% 35.25% 27.8% 43.25% 36.12% 36.88%with variable selectionbest 200 genes (Wilcoxon test)no additional variable selectionLasso selected on CV-average 13.12 out of p = 7130 genesfrom a practical perspective:if you trust in cross-validation: can validate how good we arei.e. prediction may be a black box, but we can evaluate it!

inary lymph node classification using gene expressions:a high noise problemn = 49 samples, p = 7130 gene expressionsdespite that it is classification: P[Y = 1|X = x] = E[Y |X = x]❀ ˆp(x) via linear model; can then do classificationcross-validated misclassification error (2/3 training; 1/3 test)Lasso L 2 Boosting FPLR Pelora 1-NN DLDA SVM21.1% 17.7% 35.25% 27.8% 43.25% 36.12% 36.88%with variable selectionbest 200 genes (Wilcoxon test)no additional variable selectionLasso selected on CV-average 13.12 out of p = 7130 genesfrom a practical perspective:if you trust in cross-validation: can validate how good we arei.e. prediction may be a black box, but we can evaluate it!

and in fact: we will hear that◮ Lasso is consistent for prediction assuming “essentiallynothing”◮ Lasso is optimal for prediction assuming the “compatibilitycondition” for X

Estimation of regression coefficientsY = Xβ 0 + ε, p ≫ nwith fixed (deterministic) design Xproblem of identifiability:for p > n: Xβ 0 = Xθfor any θ = β 0 + ξ, ξ in the null-space of X❀ cannot say anything about ‖ ˆβ − β 0 ‖ without furtherassumptions!❀ we will work with the compatibility assumption (see later bySara)and Sara will explain: under compatibility condition‖ ˆβ − β 0 ‖ 1 ≤ C s √0 log(p)/n,φ 2 0s 0 = |supp(β 0 )| = |{j; βj 0 ≠ 0}|

Variable selectionExample: Motif regressionfor finding HIF1α transcription factor binding sites in DNA seq.Müller, Meier, PB & RicciY i ∈ R: univariate response measuring binding intensity ofHIF1α on coarse DNA segment i (from CHIP-chip experiments)X i = (X (1)i, . . . , X (p)i) ∈ R p :X (j)i= abundance score of candidate motif j in DNA segment i(using sequence data and computational biology algorithms,e.g. MDSCAN)

question: relation between the binding intensity Y and theabundance of short candidate motifs?❀ linear model is often reasonable“motif regression” (Conlon, X.S. Liu, Lieb & J.S. Liu, 2003)Y = Xβ + ɛ, n = 287, p = 195goal: variable selection❀ find the relevant motifs among the p = 195 candidates

Lasso for variable selectionŜ(λ) = {j; ˆβ j (λ) ≠ 0}for S 0 = {j; β 0 j ≠ 0}no significance testing involvedit’s convex optimization only!(and that can be a problem... see later)

Motif regressionfor finding HIF1α transcription factor binding sites in DNA seq.Y i ∈ R: univariate response measuring binding intensity oncoarse DNA segment i (from CHIP-chip experiments)= abundance score of candidate motif j in DNA segment iX (j)ivariable selection in linear model Y i = β 0 +i = 1, . . . , n = 287, p = 195❀ Lasso selects 26 covariates and R 2 ≈ 50%i.e. 26 interesting candidate motifsp∑j=1β j X (j)i+ ε i ,

motif regression: estimated coefficients ˆβ(ˆλ CV )original datacoefficients0.00 0.05 0.10 0.15 0.200 50 100 150 200variables

“Theory” for variable selection with Lassofor (fixed design) linear model Y = Xβ 0 + ε withactive set S 0 = {j; β 0 j≠ 0}two key assumptions1. neighborhood stability condition for design X⇔ irrepresentable condition for design X2. beta-min conditionmin |βj 0 | ≥ C √ s 0 log(p)/n, C suitably largej∈S 0both conditions are sufficient and “essentially” necessary forŜ(λ) = S 0 with high probability,λ ≫√log(p)/n} {{ }larger than for pred.already proved in Meinshausen & PB, 2004 (publ: 2006)and both assumptions are restrictive!

“Theory” for variable selection with Lassofor (fixed design) linear model Y = Xβ 0 + ε withactive set S 0 = {j; β 0 j≠ 0}two key assumptions1. neighborhood stability condition for design X⇔ irrepresentable condition for design X2. beta-min conditionmin |βj 0 | ≥ C √ s 0 log(p)/n, C suitably largej∈S 0both conditions are sufficient and “essentially” necessary forŜ(λ) = S 0 with high probability,λ ≫√log(p)/n} {{ }larger than for pred.already proved in Meinshausen & PB, 2004 (publ: 2006)and both assumptions are restrictive!

neighborhood stability condition ⇔ irrepresentable condition(Zhao & Yu, 2006)n −1 X T X = ˆΣactive set S 0 = {j; β j ≠ 0} = {1, . . . , s 0 } consists of the first s 0variables; partition)ˆΣ =( ˆΣS0 ,S 0ˆΣS0 ,S c 0ˆΣ S c0 ,S 0ˆΣS c0 ,S c 0irrep. condition :‖ˆΣ S c0 ,S 0 ˆΣ−1 S 0 ,S 0sign(β 0 1 , . . . , β0 s 0) T ‖ ∞ < 1

not very realistic assumptions... what can we expect?recall: under compatibility condition‖ ˆβ − β 0 ‖ 1 ≤ C s √0 log(p)/nφ 2 0consider the relevant active variablesS relev = {j; |βj 0 | > C s √0 log(p)/n}φ 2 0then, clearly,Ŝ ⊇ S relev with high probabilityscreening for detecting the relevant variables is possible!without beta-min condition and assuming compatibilitycondition only

in addition: assuming beta-min conditionmin |βj 0 | > C s √0 log(p)/nj∈S 0 φ 2 0Ŝ ⊇ S 0 with high probabilityscreening for detecting the true variables

Tibshirani (1996):LASSO = Least Absolute Shrinkage and Selection Operatornew translation:LASSO = Least Absolute Shrinkage and Screening Operator

Practical perspectivechoice of λ: ˆλ CV from cross-validationempirical and theoretical indications (Meinshausen & PB, 2006)thatmoreoverŜ(ˆλ CV ) ⊇ S 0 (or S relev )|Ŝ(ˆλ CV )| ≤ min(n, p)(= n if p ≫ n)❀ huge dimensionality reduction (in the original covariates)

motif regression: estimated coefficients ˆβ(ˆλ CV )original datacoefficients0.00 0.05 0.10 0.15 0.200 50 100 150 200variableswhich variables in Ŝ are false positives?(p-values would be very useful!)

ecall:Ŝ(ˆλ CV ) ⊇ S 0 (or S relev )and we would then use a second-stage to reduce the numberof false positive selections❀ re-estimation on much smaller model with variables from Ŝ◮ OLS on Ŝ with e.g. BIC variable selection◮ thresholding coefficients and OLS re-estimation◮ adaptive Lasso (Zou, 2006)◮ ...

ecall:Ŝ(ˆλ CV ) ⊇ S 0 (or S relev )and we would then use a second-stage to reduce the numberof false positive selections❀ re-estimation on much smaller model with variables from Ŝ◮ OLS on Ŝ with e.g. BIC variable selection◮ thresholding coefficients and OLS re-estimation◮ adaptive Lasso (Zou, 2006)◮ ...

Adaptive Lasso (Zou, 2006)re-weighting the penalty functionˆβ = argmin β (‖Y − Xβ‖ 2 2 /n + λ p∑j=1|β j || ˆβ init,j | ),ˆβ init,j from Lasso in first stage (or OLS if p < n)} {{ }Zou (2006)threshold functionsfor orthogonal design,if ˆβ init = OLS:Adaptive Lasso = NN-garrote❀ less bias than Lasso−3 −2 −1 0 1 2 3Adaptive LassoHard−thresholdingSoft−thresholding−3 −2 −1 0 1 2 3z

motif regressionLassoAdaptive Lassocoefficients−0.05 0.00 0.05 0.10 0.15 0.20 0.25coefficients−0.05 0.00 0.05 0.10 0.15 0.20 0.250 50 100 150 200variables0 50 100 150 200variablesLasso selects 26 variables Adaptive Lasso selects 16 variables

KKT conditions and Computationcharacterization of solution(s) ˆβ as minimizer of the criterionfunctionQ λ (β) = ‖Y − Xβ‖ 2 2 /n + λ‖β‖ 1since Q λ (·) is a convex function:necessary and sufficient that subdifferential of ∂Q λ (β)/∂β at ˆβcontains the zero elementLemma 2.1 first part (in the book)denote by G(β) = −2X T (Y − Xβ)/n the gradient vector of‖Y − Xβ‖ 2 2 /nThen: ˆβ is a solution if and only ifG j ( ˆβ) = −sign( ˆβ j )λ if ˆβ j ≠ 0,|G j ( ˆβ)| ≤ λ if ˆβ j = 0

Lemma 2.1 second part (in the book)If the solution of argmin β Q λ (β) is not unique (e.g. if p > n), andif G j ( ˆβ) < λ for some solution ˆβ, then ˆβ j = 0 for all (other)solutions ˆβ in argmin β Q λ (β).The zeroes are “essentially” unique(“essentially” refers to the situation: ˆβ j = 0 and G j ( ˆβ) = λ)Proof: Exercise (optional), or see in the book

Coordinate descent algorithm for computationgeneral idea is to compute a solution ˆβ(λ grid,k ) and use it as astarting value for the computation of ˆβ(λ grid,k−1} {{ }

for squared error loss: explicit up-dating formulae (Exercise 2.7)G j (β) = −2X T j (Y − Xβ)β (m)j= sign(Z j)(|Z j | − λ/2) +ˆΣ jj,Z j = X T j (Y − Xβ −j )/n, ˆΣ = n −1 X T X.❀ componentwise soft-thresholdingthis is very fast if true problem is sparseactive set strategy: can do non-systematic cycling, visitingmainly the active (non-zero) componentsriboflavin example, n=71, p=40880.33 secs. CPU using glmnet-package in R(Friedman, Hastie & Tibshirani, 2008)

coordinate descent algorithm converges to a stationary point(Paul Tseng ≈ 2000)❀ convergence to a global optimum, due to convexity of theproblemmain assumption:objective function = smooth function + penalty} {{ }separablehere: “separable” means “additive”, i.e., pen(β) = ∑ pj=1 p j(β j )

failure of coordinate descent algorithm:Fused Lassop∑ˆβ = argmin β ‖Y − Xβ‖ 2 2 /n + λ |β j − β j−1 | + λ 2 ‖β‖ 1j=2but ∑ pj=2 |β j − β j−1 | is non-separablecontour lines of penalties for p = 2|beta1 − beta2||beta1| + |beta2|beta2−2 −1 0 1 233.5212.511.51.52.50.50.5323.5beta2−2 −1 0 1 22.533.53.532.51.50.5122.533.532.53.5−2 −1 0 1 2beta1−2 −1 0 1 2beta1

Statistics for high-dimensional data - Seminar fÃ¼r Statistik - ETH ZÃ¼rich

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?