Statistical inference for exploratory data analysis ... - Hadley Wickham
Statistical inference for exploratory data analysis ... - Hadley Wickham Statistical inference for exploratory data analysis ... - Hadley Wickham
count 40 30 20 10 0 40 30 20 10 0 40 30 20 10 0 40 30 20 10 0 40 30 20 10 0 1 5 9 13 17 2 4 6 8 10 Statistical inference for graphics 4373 2 6 10 14 18 2 4 6 8 10 tips 3 7 11 15 19 2 4 6 8 10 4 8 12 16 20 2 4 6 8 10 Figure 4. Simulation lineup: Tips Data, histograms of tip. Which histogram is most different? Explain what makes it different from the others. Figure 6 shows residuals plotted against the order in the data. Structure in the real plot would indicate the presence of lurking variables. The plot of the real data is embedded among decoys, which were produced by simulating from the residual rotation distribution, as discussed in §3 and the electronic supplementary material. Which is the plot of the real data? Why? What may be ‘lurking’? (v) Leukaemia data: This dataset originated from a study of gene expression in two types of acute leukaemia (Golub et al. 1999), acute lymphoblastic leukaemia (ALL) and acute myeloid leukaemia (AML). The dataset consists of Phil. Trans. R. Soc. A (2009) Downloaded from rsta.royalsocietypublishing.org on January 7, 2010
4374 A. Buja et al. (a) (b) HSBC returns 0.02 0.01 0 −0.01 −0.02 0.02 0.01 0 −0.01 −0.02 0.02 0.01 0 −0.01 −0.02 0.02 0.01 0 −0.01 −0.02 0.02 0.01 0 −0.01 −0.02 0.02 0.01 0 −0.01 −0.02 0.02 0.01 0 −0.01 −0.02 0.02 0.01 0 −0.01 −0.02 1 2 3 4 5 6 7 8 Jan−2005 Jul−2005 time Jan−2006 0.10 0.05 0 −0.05 −0.10 0.10 0.05 0 −0.05 −0.10 0.10 0.05 0 −0.05 −0.10 0.10 0.05 0 −0.05 −0.10 0.10 0.05 0 −0.05 −0.10 0.10 0.05 0 −0.05 −0.10 0.10 0.05 0 −0.05 −0.10 0.10 0.05 0 −0.05 −0.10 1998 2000 2002 2004 2006 time Figure 5. Permutation lineup: (a) HSBC daily returns 2005, time series of 259 trading days and (b) HSBC daily returns 1998–2005, time series of 2087 trading days. The plot of the real data is embedded among seven permutation null plots. Which plot is different from the others and why? 25 cases of AML and 47 cases of ALL (38 cases of B-cell ALL and 9 cases of T-cell ALL), giving 72 cases. After pre-processing, there are 3571 human gene expression variables. To explore class structure, one may seek a few interesting low-dimensional projections that reveal class separations using a projection pursuit method, such as LDA (linear discriminant analysis) and PDA (penalized discriminant analysis) indices (Lee 2003; Lee et al. 2005). Phil. Trans. R. Soc. A (2009) Downloaded from rsta.royalsocietypublishing.org on January 7, 2010 1 2 3 4 5 6 7 8
- Page 1 and 2: Statistical inference for explorato
- Page 3 and 4: 4362 A. Buja et al. 1. Introduction
- Page 5 and 6: 4364 A. Buja et al. multiple quanti
- Page 7 and 8: 4366 A. Buja et al. such list of fe
- Page 9 and 10: 4368 A. Buja et al. Of these approa
- Page 11 and 12: 4370 A. Buja et al. one feels diffe
- Page 13: 4372 A. Buja et al. Housing 20 000
- Page 17 and 18: 4376 A. Buja et al. 5 0 5 0 −5 5
- Page 19 and 20: 4378 A. Buja et al. 3.0 2.5 2.0 1.5
- Page 21 and 22: 4380 A. Buja et al. or detrimental
- Page 23 and 24: 4382 A. Buja et al. Belsley, D. A.,
count<br />
40<br />
30<br />
20<br />
10<br />
0<br />
40<br />
30<br />
20<br />
10<br />
0<br />
40<br />
30<br />
20<br />
10<br />
0<br />
40<br />
30<br />
20<br />
10<br />
0<br />
40<br />
30<br />
20<br />
10<br />
0<br />
1<br />
5<br />
9<br />
13<br />
17<br />
2 4 6 8 10<br />
<strong>Statistical</strong> <strong>inference</strong> <strong>for</strong> graphics 4373<br />
2<br />
6<br />
10<br />
14<br />
18<br />
2 4 6 8 10<br />
tips<br />
3<br />
7<br />
11<br />
15<br />
19<br />
2 4 6 8 10<br />
4<br />
8<br />
12<br />
16<br />
20<br />
2 4 6 8 10<br />
Figure 4. Simulation lineup: Tips Data, histograms of tip. Which histogram is most different?<br />
Explain what makes it different from the others.<br />
Figure 6 shows residuals plotted against the order in the <strong>data</strong>. Structure in the<br />
real plot would indicate the presence of lurking variables. The plot of the real<br />
<strong>data</strong> is embedded among decoys, which were produced by simulating from the<br />
residual rotation distribution, as discussed in §3 and the electronic supplementary<br />
material. Which is the plot of the real <strong>data</strong>? Why? What may be ‘lurking’?<br />
(v) Leukaemia <strong>data</strong>: This <strong>data</strong>set originated from a study of gene expression<br />
in two types of acute leukaemia (Golub et al. 1999), acute lymphoblastic<br />
leukaemia (ALL) and acute myeloid leukaemia (AML). The <strong>data</strong>set consists of<br />
Phil. Trans. R. Soc. A (2009)<br />
Downloaded from<br />
rsta.royalsocietypublishing.org on January 7, 2010