3 Initial data analysis - EMGO
3 Initial data analysis - EMGO
3 Initial data analysis - EMGO
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Title of the document:<br />
<strong>Initial</strong> <strong>data</strong> <strong>analysis</strong><br />
Page. 1 of 2<br />
Rev. Nr.: Effective date:<br />
1.1 1 jan 2010<br />
HB Nr. : 1.4-03 avs<br />
1. Aim<br />
To get a first impression of the <strong>data</strong>. Evaluating the randomisation procedures. Evaluating and<br />
potentially imputing missing values and outliers. To get an impression of the distribution properties<br />
of the continuous variables and the numbers in the subgroups. Exploration of the validity and<br />
reliability of the measurement instruments.<br />
2. Definitions<br />
3. Keywords<br />
Randomisation, missing values, distribution of continuous variables, subgroups, scale scores,<br />
measurement level of variables.<br />
4. Description<br />
Exploratory analyses<br />
This type of <strong>analysis</strong> helps in assessing whether there are missing values and/or outliers and<br />
whether categories need to be combined.<br />
First impression<br />
It is advisable to always review the distribution of all the variables to be used. Frequencies are<br />
reviewed for all categorical variables (e.g. marital status, education). Descriptive statistics<br />
(percentage missing values, average, “trimmed” average, standard deviation, median, other<br />
possible percentiles, minimum, maximum, skewness and kurtosis) are calculated for continuous<br />
variables (e.g. body weight, blood pressure). It is advisable to create figures, e.g. boxplots or<br />
histograms in order to review the distribution.<br />
Outliers<br />
So-called outliers may occur in continuous variables. These are values that, theoretically, are not<br />
“out of range”, but are extremely unlikely given the observed distribution. Reviewing averages and<br />
standard deviations is not enough to discover outliers; a frequency or boxplot will need to be<br />
generated for this.<br />
Odd combinations<br />
Cross-tabulations can be generated for categorical variables (e.g. gender x ADL limitations) in<br />
order to assess whether odd combinations are present. Scatterplots can be created for<br />
continuous variables to reveal any unlikely combinations (simply reviewing correlations is not<br />
sufficient). For instance: A weight of 120 kg and height of 1.50 metres will be an outlier in most<br />
populations. When it has been decided that a certain value or combination of values are outliers<br />
and the true value cannot be recovered from the raw <strong>data</strong>, then these need to be recoded as<br />
“missing”.<br />
Missing values<br />
Also carefully review missing values when evaluating the distributions. Often specific codes (e.g. -<br />
1 or 9) are used for missing values. Note whether these codes have been defined as missing<br />
values. If there are missing values, consider whether these need to be imputed (filled in). There<br />
are a number of methods for this: Please consult a statistician.<br />
Normal distribution<br />
If there is a requirement that the variables are normally distributed for a given <strong>analysis</strong>, it is<br />
advisable to evaluate whether a variable is in fact normally distributed. Graphs can be used for<br />
this, such as histograms or Q-Q plots. If it is apparent that the variable is not normally distributed,<br />
then a transformation could be considered (for instance a logarithm transformation) to see<br />
whether this improves matters.
Title of the document:<br />
<strong>Initial</strong> <strong>data</strong> <strong>analysis</strong><br />
Page. 2 of 2<br />
Rev. Nr.: Effective date:<br />
1.1 1 jan 2010<br />
HB Nr. : 1.4-03 avs<br />
Distribution of categories<br />
Categories can be combined if the numbers in one or more categories is/are too small. The need<br />
for this is not always evident from an ordinary frequency distribution. However, it can be apparent<br />
from a cross-tabulation.<br />
For instance in a study where there is stratification by gender and education, the cross-tabulation<br />
of education by gender shows that for men the lowest category “not completed primary education”<br />
rarely occurs, whereas for women the highest category “completed university education” rarely<br />
occurs. The lowest and next lowest categories can then be added together, as well as the highest<br />
and second highest.<br />
Evaluating the randomisation procedure<br />
In order to evaluate whether the randomisation has been “successful”, the distribution of all the<br />
relevant (prognostic) variables needs to be reviewed separately for each treatment arm.<br />
Descriptive statistics (percentages, averages, median, standard deviation, range) can be used for<br />
this. Differences between groups can be tested (e.g. chi-square or t-test), although it needs to<br />
remembered that due to the randomisation procedure any differences found are, by definition, due<br />
to chance.<br />
Scale scores<br />
Prior to the items in a scale being summed, the way in which the items behave in the sample<br />
needs to be evaluated. The first step in this process is a frequency plot of the items in a scale.<br />
Usually there are “positive” and “negative” items. It may be necessary to reverse-score the<br />
positive or negative items prior to summing the items to a sum score, to ensure all items are<br />
scored in the same direction. The items can then be summed, possibly once the response<br />
categories have been combined (e.g. “very severe” and “severe”).<br />
The second step is a reliability and/or principal components <strong>analysis</strong>. The principal components or<br />
factor <strong>analysis</strong> can be used to explore which items belong to which (sub)scales. Cronbach’s alpha<br />
can be used to determine the internal consistency (homogeneity) of the scale. It is advisable to<br />
always determine the Cronbach’s alpha for the scales and, if possible, a principal components<br />
<strong>analysis</strong> (refer to the guideline Selecting, translating and validating questionnaires) to evaluate<br />
whether the expected scales are also evident in the <strong>data</strong>.<br />
If it is apparent from the study that a given item does not fit the scale (e.g. the item-total<br />
correlation is too low or it does not load adequately onto the principal component), then whether<br />
this item should be excluded from the sum score needs to be considered. This does, of course,<br />
have consequences on the comparability of scores with other studies. This should therefore be<br />
considered carefully. In general it is not advisable to modify frequently used scales. It is better to<br />
use the original scales and report the results found (e.g. low alpha or low item-total correlations) in<br />
the discussion of the article.<br />
5. Details<br />
Audit questions<br />
− Has the distribution of all the variables been reviewed<br />
− Were there variables with a high percentage of missing values If so, how were these<br />
dealt with<br />
− Have outliers been explored If so, how<br />
− Have the cell numbers for central variables been taken into consideration<br />
− Where relevant: How were (large) deviations from normality solved<br />
− Has been assessed whether the items belonging to a scale actually fit to the scale<br />
−<br />
6. Appendices/references/links<br />
7. Amendments<br />
V1.1: 1 Jan 2010: English translation.<br />
V1.0: 23 Apr 2007.