13.07.2015 Views

Stat-403/Stat-650 : Intermediate Sampling and Experimental Design ...

Stat-403/Stat-650 : Intermediate Sampling and Experimental Design ...

Stat-403/Stat-650 : Intermediate Sampling and Experimental Design ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Spreadsheets in <strong>Stat</strong>istical Practice—Another LookJ. C. NASHMany authors have criticized the use of spreadsheets for statisticaldata processing <strong>and</strong> computing because of incorrect statisticalfunctions, no log file or audit trail, inconsistent behaviorof computational dialogs, <strong>and</strong> poor h<strong>and</strong>ling of missing values.Some improvements in some spreadsheet processors <strong>and</strong>the possibility of audit trail facilities suggest that the use of aspreadsheet for some statistical data entry <strong>and</strong> simple analysistasks may now be acceptable. A brief outline of some issues <strong>and</strong>some guidelines for good practice are included.KEY WORDS:Audit trail; Data entry; <strong>Stat</strong>istical computing.1. CONCERNS ABOUT SPREADSHEETSThe ubiquity of spreadsheets has encouraged their use instatistics as well as most other areas of quantitative endeavour.Panko <strong>and</strong> Ordway (2005, also panko.cba.hawaii.edu/ ssr/ )showed that a vast majority of financial <strong>and</strong> managementplanning <strong>and</strong> decision-making uses spreadsheets, sometimeswith disastrous consequences (Brethour 2003). The EuropeanSpreadsheet Risks Interest Group, which in fact has worldwideparticipation, considers these issues. See www.eusprig.org formany useful examples <strong>and</strong> links to their conference proceedings.Many statisticians dislike spreadsheets in statistical practice,first because of bugs or inaccuracies in the mathematical or statisticalfunctions of the spreadsheet programs. A sample of referencesincludes Cryer (2002), Nash <strong>and</strong> Quon (1996), Nash,Quon, <strong>and</strong> Gianini (1995), <strong>and</strong> contributions by McCullough(1998, 1998) <strong>and</strong> McCullough <strong>and</strong> Wilson (2002, 2005).A second concern is data entry <strong>and</strong> edit, where the lack of anaudit trail of changes to the spreadsheet data is an invitation topoor <strong>and</strong> unverifiable work (Nash <strong>and</strong> Quon 1996). Yet spreadsheetuse is almost casual, for example, by Mount et al. (2004):“Data were entered into Microsoft Access <strong>and</strong> Microsoft Excel<strong>and</strong> exported to <strong>Stat</strong>a (version 7) for analysis.”Practitioners are well-aware how easily errors <strong>and</strong> falsificationsarise in data collection. An excellent <strong>and</strong> entertainingoverview was given by Gentleman (2000). Popular statisticalpackages offer an audit or log file as an aid for checking workperformed.A third issue is that the use of “one tool for all tasks” mayleave students unaware of the diversity of tools <strong>and</strong> unable toselect the most appropriate software for their needs (Hunt 1995;J. C. Nash is Professor, School of Management, University of Ottawa, ON12K1N 9B5, Canada (E-mail: nashjc@uottawa.ca). This article would not havebeen written without the stimulation <strong>and</strong> interaction with Neil Smith, AndyAdler, Sylvie Noël, <strong>and</strong> Jody Goldberg. The author is involved with preparingtest spreadsheets for the Gnumeric project.College Entrance Exam Board 2002). Despite the pedagogicalconvenience of familiar software, statisticians have a role in promotingthe use of tools appropriate to the task.Most of us are likely, however, to use spreadsheets orspreadsheet-like interfaces, possibly in statistical packages suchas Minitab, <strong>Stat</strong>istica, UNISTAT, <strong>and</strong> NCSS. There are goodreasons for this. Spreadsheets allow the user to access the datamore or less r<strong>and</strong>omly. That is, we can go to any cell <strong>and</strong> makea change. If cells contain formulas or functions, the spreadsheetcomputational paradigm is supposed to ensure that all dependentcells of the dataset are updated.Updating is useful, but it is also dangerous, since we can doa lot of damage with clumsy fingers on the keyboard. Furthermore,as noted by Nash <strong>and</strong> Quon (1996), some of the statisticaldialogs of spreadsheets, for example, regression, result in staticoutputs—a violation of the spreadsheet paradigm that results inerrors when users do not re-run the calculations after updatingtheir data. The confusion is worsened by different behavior dependingon the calculation chosen <strong>and</strong> the spreadsheet processor.In Excel 2003, ANOVA updates while regression does not. A“recalculate” instruction does not suffice.Nevertheless, developments in spreadsheets may render themsuitable for some statistical work. I will try to suggest someappropriate applications.2. MOTIVATIONS AND GOALSMy main objective is to encourage statisticians to learn where<strong>and</strong> how spreadsheets (indeed any software) may be appropriatein their work. Software developments, some outside statistics,offer potentially “safer” ways to use spreadsheets in statisticalwork. Where good statistical packages or well-constructeddatabases for data entry <strong>and</strong> edit are unavailable spreadsheetsmay prove useful. My message is harm reduction as opposedto abstinence. The developments, some incomplete, that informmy view on spreadsheet use in statistics involve• improved statistical functions;• audit trails of spreadsheet work; <strong>and</strong>• improved data <strong>and</strong> program transfer (e.g., http:// www.oasis-open.org).These ideas offer potential benefits to statistical practitioners,especially because many of the ideas are being developedcollaboratively with involvement of users.3. IMPROVING SPREADSHEET FUNCTIONSComputational “add-ins” to spreadsheets, especially MicrosoftExcel, claiming to allow “correct” statistical computationsto be performed include Analyse-It (www.analyse-it.com),UNISTAT (www.unistat.com), <strong>and</strong> Palisade <strong>Stat</strong>tools (www.palisade.com/ html/ stattools.asp). Alternatively, RSvr is a freelyavailable tool to allow Excel to use functions in the open-sourceR statistical package (cran.r-project.org/ contrib/ extra/ dcom/ ).© 2006 American <strong>Stat</strong>istical Association DOI: 10.1198/000313006X126585 The American <strong>Stat</strong>istician, August 2006, Vol. 60, No. 3 287

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!