9 Missing Data - EMGO

9 Missing Data - EMGO 9 Missing Data - EMGO

10.07.2015 Views

Title of the document:Handling Missing DataPage. 1 of 7Rev. Nr.: Effective date:1.1 5-7-2012HB Nr. : 1.4-09aw1. AimTo give researchers a structured guideline for handling missing data2. Definitions3. KeywordsMissing data, Missing completely at random, Missing at Random, Missing not at random, Imputation.4. Description4.1 IntroductionMissing data is a common problem in all kinds of research. The way you deal with it depends on howmuch data is missing, the kind of missing data (single items, a full questionnaire, a measurementwave), and why it is missing, i.e. the reasons that the data are missing. Handling missing data is animportant step in several phases of your study.4.2 Why do you need to do something with missing data?The default option in SPSS is that cases with missing values are not included in the analyses.Deleting cases or persons results in a smaller sample size and larger standard errors. As a result thepower to find a significant result decreases and the chance that you correctly accept the alternativehypothesis of an effect (compared to the null hypothesis of no effect) is smaller. Secondly, youintroduce bias in effect estimates, like mean differences (from t-tests) or regression coefficients (fromregression analyses). When the group of non-responders is large, and you delete them, your samplecharacteristics are different from your original sample and from the population you study. There couldbe a difference in characteristics between responders and non-responders. Therefore you need toinspect the missing data, before doing further analyses. Thus, always check the missing data in yourdata set before starting your analyses, and do never simply delete persons in your dataset withmissing values (default option in SPSS).4.3 What to do with missing data in different phases of your studyData preparation:If you work with questionnaires, make sure that all questions are clear and applicable to your respondents.If necessary, use the ‘not applicable’ answer option. To decrease the chance of missing data, use digitalapplications to collect your data, such as Web based questionnaires where you can set the option thatanswering the question is required. You can also use these applications for sending reminders andtracking the respondents’ progress. If you work with physical or physiological data, the most frequentcause of missing data is a technical problem with the instruments. Testing the instruments in a pilot studywill partly prevent you for these problems.Data collection:Closely monitor the completeness of the data when you receive or obtain the data. When you detectmissing data during data collection, try to complete your data. Look back in the raw data(questionnaires), or ask your respondents to fill out the missing items. Describe in your logbook whydata are missing. This helps you to decide whether data are missing at random or not.Data processing:Investigate the number of missing data you have (see 4.4) and estimate the need for imputation andthink about the most adequate imputation method (see 4.5 and further).Data analyses:If you have missing values in your data set when starting your analyses, remember that case wiseand list wise deletion (default in SPSS regression and ANOVAs) may hamper the reliability of yourresults (see 4.2).4.4 How much data is missing?

Title of the document:Handling <strong>Missing</strong> <strong>Data</strong>Page. 1 of 7Rev. Nr.: Effective date:1.1 5-7-2012HB Nr. : 1.4-09aw1. AimTo give researchers a structured guideline for handling missing data2. Definitions3. Keywords<strong>Missing</strong> data, <strong>Missing</strong> completely at random, <strong>Missing</strong> at Random, <strong>Missing</strong> not at random, Imputation.4. Description4.1 Introduction<strong>Missing</strong> data is a common problem in all kinds of research. The way you deal with it depends on howmuch data is missing, the kind of missing data (single items, a full questionnaire, a measurementwave), and why it is missing, i.e. the reasons that the data are missing. Handling missing data is animportant step in several phases of your study.4.2 Why do you need to do something with missing data?The default option in SPSS is that cases with missing values are not included in the analyses.Deleting cases or persons results in a smaller sample size and larger standard errors. As a result thepower to find a significant result decreases and the chance that you correctly accept the alternativehypothesis of an effect (compared to the null hypothesis of no effect) is smaller. Secondly, youintroduce bias in effect estimates, like mean differences (from t-tests) or regression coefficients (fromregression analyses). When the group of non-responders is large, and you delete them, your samplecharacteristics are different from your original sample and from the population you study. There couldbe a difference in characteristics between responders and non-responders. Therefore you need toinspect the missing data, before doing further analyses. Thus, always check the missing data in yourdata set before starting your analyses, and do never simply delete persons in your dataset withmissing values (default option in SPSS).4.3 What to do with missing data in different phases of your study<strong>Data</strong> preparation:If you work with questionnaires, make sure that all questions are clear and applicable to your respondents.If necessary, use the ‘not applicable’ answer option. To decrease the chance of missing data, use digitalapplications to collect your data, such as Web based questionnaires where you can set the option thatanswering the question is required. You can also use these applications for sending reminders andtracking the respondents’ progress. If you work with physical or physiological data, the most frequentcause of missing data is a technical problem with the instruments. Testing the instruments in a pilot studywill partly prevent you for these problems.<strong>Data</strong> collection:Closely monitor the completeness of the data when you receive or obtain the data. When you detectmissing data during data collection, try to complete your data. Look back in the raw data(questionnaires), or ask your respondents to fill out the missing items. Describe in your logbook whydata are missing. This helps you to decide whether data are missing at random or not.<strong>Data</strong> processing:Investigate the number of missing data you have (see 4.4) and estimate the need for imputation andthink about the most adequate imputation method (see 4.5 and further).<strong>Data</strong> analyses:If you have missing values in your data set when starting your analyses, remember that case wiseand list wise deletion (default in SPSS regression and ANOVAs) may hamper the reliability of yourresults (see 4.2).4.4 How much data is missing?


Title of the document:Handling <strong>Missing</strong> <strong>Data</strong>Page. 3 of 7Rev. Nr.: Effective date:1.1 5-7-2012HB Nr. : 1.4-09awRubin developed in 1976 a typology for missing data.Type of missingsMCAR: <strong>Missing</strong> Completely AtRandom:MAR: <strong>Missing</strong> at Random (most ofthe time)MNAR: <strong>Missing</strong> Not At Random:DescriptionThe data are MCAR when the probability that a value fora certain variable is missing is unrelated to the value ofother observed variables, or unrelated to the variablewith missing values itself. An example is whenrespondents accidentally skip questions. In other words,the observed values in your dataset is just a randomsample from your dataset, when it would have beencomplete.The data are MAR when the probability that a value for acertain variable is missing is related to observed valueson other variables. An example is when olderrespondents have more missing values than youngerrespondents. However, within the group of older andyounger respondents, the data are still MCAR. Anotherexample is when respondents with low scores on thefirst wave are not invited for a second wave.The data are MNAR when the probability that a value fora certain variable is missing is related to the scores onthat variable itself. An example is that respondents withlow income intentionally skip their low income scoresbecause that violates their privacy. In that case, theprobability that an observation is missing depends oninformation that is not observed, like the value of theincome score, because only low values are missing.MNAR is a serious problem, which can not be solvedwith a technique as multiple imputation.How do you know what kind of missings you have?There are three kinds of methods.1. First you can inspect the data by yourself. Are the missings equally distributed in the data. Arelow and / or high scores missing? If the missings are not equally spread this might be anindication that the data are MNAR. With this method you a-priori must now what thedistribution of the variable normally is, i.e. is it normal or skewed? You need this informationbefore you can judge which part of the data suffers from missing values. This method onlyapplies if your dataset is large.2. Second, SPSS can test whether the respondents with missing data differ from therespondents without missing data on important variables (Analyze <strong>Missing</strong> ValueAnalysis select important variables descriptivest-test formed by indicator. Significant?Indication for MAR. Be aware that if your sample size is large (>500) this t-test might besignificant if the data truly are not MAR. So, just looking at the means and their differencemight be good enough. In case this mean difference is very small, this might be an indicationof MCAR.3. In SPSS via (Analyze <strong>Missing</strong> Value Analysis, EM button), it is also possible to do a test forMCAR data. This is called Little´s test. A tutorial of the <strong>Missing</strong> Value Analysis (SPSS 16 andfurther) procedures in SPSS can be found via the Help button.It is important to note that you’re not able to test whether your missing data is MAR or MNAR. Theabove mentioned procedures (1 and 2) will only give you an indication. Pay attention to thepossibility of MNAR, because all analyses have serious problems when your missing data isMNAR.


Title of the document:Handling <strong>Missing</strong> <strong>Data</strong>Page. 4 of 7Rev. Nr.: Effective date:1.1 5-7-2012HB Nr. : 1.4-09aw4.7 How to handle missing data?<strong>Missing</strong> data is random:For MCAR and MAR, many missing data methods have been developed in the last two decades(Schafer & Graham, 2002). Although MCAR seems to be the least problematic mechanism,deleting cases can still reduce the power of finding an effect. It is argued that the MARmechanism is most frequently seen in practice. An argument for this is that in most researchmultifactorial or multivariable problems are studied, so when data on variables are missing it ismostly related to other variables in the dataset.<strong>Missing</strong> data is not random:For MNAR, imputation is not sufficient, because the missing data are totally different from theavailable data, i.e. your complete data has become a selective group of persons. If you think yourdata is MNAR it might be wise to contact a statistician from <strong>EMGO</strong>+ who is willing to help you.For MCAR and MAR, there are roughly two kinds of techniques for imputation. Single and MultipleImputation.Single imputation is possible in SPSS and is an easy way to handle missings when just a fewcases are missing (less than 5%) and you think your missing values are MCAR or MAR. However,after single imputation the cases are more similar which may result in an underestimation of thestandard errors, i.e. smaller confidence intervals. This increases the chance of a type 1 error (thenull hypothesis of no effect is rejected, while there is truly no effect). Therefore, this method is lessadequate when you have >5% missing data.Multiple imputation is more complex, but also implemented in SPSS 17.0 and later versions.Multiple imputation takes into account the uncertainty of missing values (present in all values ofvariables) and is therefore more preferred than single imputation. When your missingness is high(exceeds 5% in several variables and different persons) multiple imputation is more adequate.Imputation techniquesSingle imputationSingle imputation techniques are based on the idea that in a random sample every person can bereplaced by a new person, given that this new person is randomly chosen from the same sourcepopulation as the original person. In that case you can use the observed available data of theother persons to make an estimation of the distribution of the test result in the source population.It is called single imputation, because each missing is imputed once.There are many methods for single imputation, such as replacement by the mean, regression,and expected maximization. Expected maximization is preferred, because in the other methodsthe variance and standard error are reduced and the chance for Type II errors increases.Expected maximization forms a missing data correlation matrix by assuming the shape of adistribution for the missing data and imputes missing values on the likelihood under thatdistribution. Single imputation is possible in SPSS (analyze – missing value analyses – button EMfor Expected Maximization). Contact a statistician from <strong>EMGO</strong>+ who is willing to help you with thisprocedure.For the imputation of a missing score on a single item in a questionnaire (see 4.5) , SPSSsyntaxes can be found at:http://www.tilburguniversity.edu/nl/over-tilburguniversity/schools/socialsciences/organisatie/departementen/mto/onderzoek/software/tw.zip: Software for two-way imputation in SPSS. (Van Ginkel & Van der Ark, 2003a), orrf.zip: Software for response function imputation in SPSS (Van Ginkel & Van der Ark, 2003b).


Title of the document:Handling <strong>Missing</strong> <strong>Data</strong>Page. 5 of 7Rev. Nr.: Effective date:1.1 5-7-2012HB Nr. : 1.4-09awMultiple imputation (MI)The difference with single imputation is that in MI the value is imputed for several times. There aremore imputed datasets created. The different imputations are then based on random draws ofdifferent estimations of the underlying distribution in the source population. In this way, theimputed data comes from different distributions and therefore are less look alike. There is moreuncertainty created in the dataset. Therefore the standard error increases. The amount ofimputations is dependent on the amount of missing data, but mostly 5 to 10 imputations areenough. A drawback of this method it that several imputed datasets are created and that thestatistical analysis has to be repeated in each dataset. Finally, results have to be pooled in asummary measure. Most statistical packages can do this automatically. Multiple imputation ispossible in recent versions (vs 17) of SPSS (analyze – multiple imputation – impute missing datavalues). For more information see references. Contact a statistician from <strong>EMGO</strong>+ who is willing tohelp you with this procedure.Sensitivity analysisAfter imputation, sensitivity analysis is needed to determine how your substantive results dependon how you handled the missing data.Follow these steps:1. Do a complete case analysis (default option in SPSS; cases with missings are notincluded)2. Do a missing data analysis after you imputed the results3. Compare substantive conclusions, decide how to report.When is imputation of missing data not necessary?1) When your missing data is MCAR or MAR, and you use Maximum Likelihood estimationtechniques in analyses such as Structural Equation Modelling (SEM) or Linear Mixed Models(LMM), imputation of missing data is not necessary. These techniques use the available data,and ignore the missing values and still give correct results. In such situations you do not haveto use an extra imputation technique to handle your missing values. <strong>Missing</strong> data that areMNAR is still a problem for these methods.2) A different approach may be used for descriptive studies. If you want to show the (observed)study data (means and standard deviations), for example to compare them with othercountries/settings, without directly linking them to a conclusion, imputation is not immediatelyneeded. However, the evaluative statistics (t-tests, regressions, etc.) would certainly needcomplete case analysis. So, if you use statistical tests to compare the descriptive, imputationis needed (of course depending on the amount and type of missing data). In this final case,you link your descriptive to a conclusion and want a corrected p-value / 95% CI, and thereforeyou need to use the data with imputed values. Do not forget the reviewer, who maysometimes have problems with using imputed and non-imputed data in one paper. Be clearabout imputation and point out why you choose to present imputed/non-imputed data.4. Summary• Make every effort to avoid missing data, or failing that, to understand how much andwhy data is missing.• Understand missing data mechanisms (MCAR, MAR, MNAR) and their implications• Avoid default methods (listwise deletion, pairwise deletion)• Avoid default fixups (mean imputation, etc.) where possible• Use multiple imputation to take proper account of missings• Do a sensitivity analysis


Title of the document:Handling <strong>Missing</strong> <strong>Data</strong>Page. 6 of 7Rev. Nr.: Effective date:1.1 5-7-2012HB Nr. : 1.4-09aw5. DetailsI have missings inmy dataWhat is the type ofmissing?MCAR/MARMNARAsk a statistician from<strong>EMGO</strong>+ to help youTry to completeyour datasetAsk a statistician from<strong>EMGO</strong>+ to help youI use SEM or LMMfor my analysesI do not use SEM orLMM for my analyses.Imputation is needed.Use ML estimation,no imputation isneededHow much data ismissing?< 5% use singleimputation>5% Use multipleimputation


Title of the document:Handling <strong>Missing</strong> <strong>Data</strong>Page. 7 of 7Rev. Nr.: Effective date:1.1 5-7-2012HB Nr. : 1.4-09aw6. Appendices/references/linksMultiple Imputation Methods, Niels Smits (technical literature).http://www2.chass.ncsu.edu/garson/pa765/missing.htmhttp://www.ssc.upenn.edu/~allison/MultInt99.pdf (especially for Multiple Imputation)Ask <strong>EMGO</strong>+ statisticians for help via:http://www.emgo.nl/kc/preparation/research%20design/3%20Advice%20and%20support.html<strong>EMGO</strong>+ experts on <strong>Missing</strong> <strong>Data</strong>Martijn Heymans: mw.heymans@vumc.nlJos Twisk: jwr.twisk@vumc.nlRecommended (non-technical) literature.1. Sterne JA, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, Wood AM, Carpenter JR.Multiple imputation for missing data in epidemiological and clinical research: potential andpitfalls. BMJ 2009;338:b2393. doi: 10.1136/bmj.b2393.2. Allison, P.D. (2001). <strong>Missing</strong> <strong>Data</strong> (Sage University Papers Series on QuantitativeApplications in the Social Sciences, series no. 07-136). Thousand Oaks: Sage.3. Schafer, J.L. & Graham, J.W. (2002). <strong>Missing</strong> data: Our view of the state of the art.Psychological Methods, 7, 147-177.4. Donders AR, van der Heijden GJ, Stijnen T, Moons KG. Review: a gentle introduction toimputation of missing values. J Clin Epidemiol. 2006; 59(10):1087-91. Review.5. http://www.stat.psu.edu/~jls/mifaq.html (Multiple Imputation FAQ page met uitleg)6. Van Ginkel, J. R., & Van der Ark, L. A. (2003a). SPSS syntax for two-way imputation ofmissing test data [computer software and manual]. Retrieved fromhttp://www.tilburguniversity.edu/nl/over-tilburguniversity/schools/socialsciences/organisatie/departementen/mto/onderzoek/software/7. Van Ginkel, J. R., & Van der Ark, L. A. (2003b). SPSS syntax for response function imputationof missing test data [computer software and manual]. Retrieved fromhttp://www.tilburguniversity.edu/nl/over-tilburguniversity/schools/socialsciences/organisatie/departementen/mto/onderzoek/software/7. AmendmentsV1.0 1-12-2011V1.1 5-7-2012 : addition to section When is imputation of missing data not necessary?

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!