15.01.2015 Views

3 Initial data analysis - EMGO

3 Initial data analysis - EMGO

3 Initial data analysis - EMGO

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Title of the document:<br />

<strong>Initial</strong> <strong>data</strong> <strong>analysis</strong><br />

Page. 1 of 2<br />

Rev. Nr.: Effective date:<br />

1.1 1 jan 2010<br />

HB Nr. : 1.4-03 avs<br />

1. Aim<br />

To get a first impression of the <strong>data</strong>. Evaluating the randomisation procedures. Evaluating and<br />

potentially imputing missing values and outliers. To get an impression of the distribution properties<br />

of the continuous variables and the numbers in the subgroups. Exploration of the validity and<br />

reliability of the measurement instruments.<br />

2. Definitions<br />

3. Keywords<br />

Randomisation, missing values, distribution of continuous variables, subgroups, scale scores,<br />

measurement level of variables.<br />

4. Description<br />

Exploratory analyses<br />

This type of <strong>analysis</strong> helps in assessing whether there are missing values and/or outliers and<br />

whether categories need to be combined.<br />

First impression<br />

It is advisable to always review the distribution of all the variables to be used. Frequencies are<br />

reviewed for all categorical variables (e.g. marital status, education). Descriptive statistics<br />

(percentage missing values, average, “trimmed” average, standard deviation, median, other<br />

possible percentiles, minimum, maximum, skewness and kurtosis) are calculated for continuous<br />

variables (e.g. body weight, blood pressure). It is advisable to create figures, e.g. boxplots or<br />

histograms in order to review the distribution.<br />

Outliers<br />

So-called outliers may occur in continuous variables. These are values that, theoretically, are not<br />

“out of range”, but are extremely unlikely given the observed distribution. Reviewing averages and<br />

standard deviations is not enough to discover outliers; a frequency or boxplot will need to be<br />

generated for this.<br />

Odd combinations<br />

Cross-tabulations can be generated for categorical variables (e.g. gender x ADL limitations) in<br />

order to assess whether odd combinations are present. Scatterplots can be created for<br />

continuous variables to reveal any unlikely combinations (simply reviewing correlations is not<br />

sufficient). For instance: A weight of 120 kg and height of 1.50 metres will be an outlier in most<br />

populations. When it has been decided that a certain value or combination of values are outliers<br />

and the true value cannot be recovered from the raw <strong>data</strong>, then these need to be recoded as<br />

“missing”.<br />

Missing values<br />

Also carefully review missing values when evaluating the distributions. Often specific codes (e.g. -<br />

1 or 9) are used for missing values. Note whether these codes have been defined as missing<br />

values. If there are missing values, consider whether these need to be imputed (filled in). There<br />

are a number of methods for this: Please consult a statistician.<br />

Normal distribution<br />

If there is a requirement that the variables are normally distributed for a given <strong>analysis</strong>, it is<br />

advisable to evaluate whether a variable is in fact normally distributed. Graphs can be used for<br />

this, such as histograms or Q-Q plots. If it is apparent that the variable is not normally distributed,<br />

then a transformation could be considered (for instance a logarithm transformation) to see<br />

whether this improves matters.


Title of the document:<br />

<strong>Initial</strong> <strong>data</strong> <strong>analysis</strong><br />

Page. 2 of 2<br />

Rev. Nr.: Effective date:<br />

1.1 1 jan 2010<br />

HB Nr. : 1.4-03 avs<br />

Distribution of categories<br />

Categories can be combined if the numbers in one or more categories is/are too small. The need<br />

for this is not always evident from an ordinary frequency distribution. However, it can be apparent<br />

from a cross-tabulation.<br />

For instance in a study where there is stratification by gender and education, the cross-tabulation<br />

of education by gender shows that for men the lowest category “not completed primary education”<br />

rarely occurs, whereas for women the highest category “completed university education” rarely<br />

occurs. The lowest and next lowest categories can then be added together, as well as the highest<br />

and second highest.<br />

Evaluating the randomisation procedure<br />

In order to evaluate whether the randomisation has been “successful”, the distribution of all the<br />

relevant (prognostic) variables needs to be reviewed separately for each treatment arm.<br />

Descriptive statistics (percentages, averages, median, standard deviation, range) can be used for<br />

this. Differences between groups can be tested (e.g. chi-square or t-test), although it needs to<br />

remembered that due to the randomisation procedure any differences found are, by definition, due<br />

to chance.<br />

Scale scores<br />

Prior to the items in a scale being summed, the way in which the items behave in the sample<br />

needs to be evaluated. The first step in this process is a frequency plot of the items in a scale.<br />

Usually there are “positive” and “negative” items. It may be necessary to reverse-score the<br />

positive or negative items prior to summing the items to a sum score, to ensure all items are<br />

scored in the same direction. The items can then be summed, possibly once the response<br />

categories have been combined (e.g. “very severe” and “severe”).<br />

The second step is a reliability and/or principal components <strong>analysis</strong>. The principal components or<br />

factor <strong>analysis</strong> can be used to explore which items belong to which (sub)scales. Cronbach’s alpha<br />

can be used to determine the internal consistency (homogeneity) of the scale. It is advisable to<br />

always determine the Cronbach’s alpha for the scales and, if possible, a principal components<br />

<strong>analysis</strong> (refer to the guideline Selecting, translating and validating questionnaires) to evaluate<br />

whether the expected scales are also evident in the <strong>data</strong>.<br />

If it is apparent from the study that a given item does not fit the scale (e.g. the item-total<br />

correlation is too low or it does not load adequately onto the principal component), then whether<br />

this item should be excluded from the sum score needs to be considered. This does, of course,<br />

have consequences on the comparability of scores with other studies. This should therefore be<br />

considered carefully. In general it is not advisable to modify frequently used scales. It is better to<br />

use the original scales and report the results found (e.g. low alpha or low item-total correlations) in<br />

the discussion of the article.<br />

5. Details<br />

Audit questions<br />

− Has the distribution of all the variables been reviewed<br />

− Were there variables with a high percentage of missing values If so, how were these<br />

dealt with<br />

− Have outliers been explored If so, how<br />

− Have the cell numbers for central variables been taken into consideration<br />

− Where relevant: How were (large) deviations from normality solved<br />

− Has been assessed whether the items belonging to a scale actually fit to the scale<br />

−<br />

6. Appendices/references/links<br />

7. Amendments<br />

V1.1: 1 Jan 2010: English translation.<br />

V1.0: 23 Apr 2007.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!