3 Initial data analysis - EMGO

Title of the document: 

Initial data analysis 

Page. 1 of 2 

Rev. Nr.: Effective date: 

1.1 1 jan 2010 

HB Nr. : 1.4-03 avs 

1. Aim 

To get a first impression of the data. Evaluating the randomisation procedures. Evaluating and 

potentially imputing missing values and outliers. To get an impression of the distribution properties 

of the continuous variables and the numbers in the subgroups. Exploration of the validity and 

reliability of the measurement instruments. 

2. Definitions 

3. Keywords 

Randomisation, missing values, distribution of continuous variables, subgroups, scale scores, 

measurement level of variables. 

4. Description 

Exploratory analyses 

This type of analysis helps in assessing whether there are missing values and/or outliers and 

whether categories need to be combined. 

First impression 

It is advisable to always review the distribution of all the variables to be used. Frequencies are 

reviewed for all categorical variables (e.g. marital status, education). Descriptive statistics 

(percentage missing values, average, “trimmed” average, standard deviation, median, other 

possible percentiles, minimum, maximum, skewness and kurtosis) are calculated for continuous 

variables (e.g. body weight, blood pressure). It is advisable to create figures, e.g. boxplots or 

histograms in order to review the distribution. 

Outliers 

So-called outliers may occur in continuous variables. These are values that, theoretically, are not 

“out of range”, but are extremely unlikely given the observed distribution. Reviewing averages and 

standard deviations is not enough to discover outliers; a frequency or boxplot will need to be 

generated for this. 

Odd combinations 

Cross-tabulations can be generated for categorical variables (e.g. gender x ADL limitations) in 

order to assess whether odd combinations are present. Scatterplots can be created for 

continuous variables to reveal any unlikely combinations (simply reviewing correlations is not 

sufficient). For instance: A weight of 120 kg and height of 1.50 metres will be an outlier in most 

populations. When it has been decided that a certain value or combination of values are outliers 

and the true value cannot be recovered from the raw data, then these need to be recoded as 

“missing”. 

Missing values 

Also carefully review missing values when evaluating the distributions. Often specific codes (e.g. - 

1 or 9) are used for missing values. Note whether these codes have been defined as missing 

values. If there are missing values, consider whether these need to be imputed (filled in). There 

are a number of methods for this: Please consult a statistician. 

Normal distribution 

If there is a requirement that the variables are normally distributed for a given analysis, it is 

advisable to evaluate whether a variable is in fact normally distributed. Graphs can be used for 

this, such as histograms or Q-Q plots. If it is apparent that the variable is not normally distributed, 

then a transformation could be considered (for instance a logarithm transformation) to see 

whether this improves matters.

Title of the document: 

Initial data analysis 

Page. 2 of 2 

Rev. Nr.: Effective date: 

1.1 1 jan 2010 

HB Nr. : 1.4-03 avs 

Distribution of categories 

Categories can be combined if the numbers in one or more categories is/are too small. The need 

for this is not always evident from an ordinary frequency distribution. However, it can be apparent 

from a cross-tabulation. 

For instance in a study where there is stratification by gender and education, the cross-tabulation 

of education by gender shows that for men the lowest category “not completed primary education” 

rarely occurs, whereas for women the highest category “completed university education” rarely 

occurs. The lowest and next lowest categories can then be added together, as well as the highest 

and second highest. 

Evaluating the randomisation procedure 

In order to evaluate whether the randomisation has been “successful”, the distribution of all the 

relevant (prognostic) variables needs to be reviewed separately for each treatment arm. 

Descriptive statistics (percentages, averages, median, standard deviation, range) can be used for 

this. Differences between groups can be tested (e.g. chi-square or t-test), although it needs to 

remembered that due to the randomisation procedure any differences found are, by definition, due 

to chance. 

Scale scores 

Prior to the items in a scale being summed, the way in which the items behave in the sample 

needs to be evaluated. The first step in this process is a frequency plot of the items in a scale. 

Usually there are “positive” and “negative” items. It may be necessary to reverse-score the 

positive or negative items prior to summing the items to a sum score, to ensure all items are 

scored in the same direction. The items can then be summed, possibly once the response 

categories have been combined (e.g. “very severe” and “severe”). 

The second step is a reliability and/or principal components analysis. The principal components or 

factor analysis can be used to explore which items belong to which (sub)scales. Cronbach’s alpha 

can be used to determine the internal consistency (homogeneity) of the scale. It is advisable to 

always determine the Cronbach’s alpha for the scales and, if possible, a principal components 

analysis (refer to the guideline Selecting, translating and validating questionnaires) to evaluate 

whether the expected scales are also evident in the data. 

If it is apparent from the study that a given item does not fit the scale (e.g. the item-total 

correlation is too low or it does not load adequately onto the principal component), then whether 

this item should be excluded from the sum score needs to be considered. This does, of course, 

have consequences on the comparability of scores with other studies. This should therefore be 

considered carefully. In general it is not advisable to modify frequently used scales. It is better to 

use the original scales and report the results found (e.g. low alpha or low item-total correlations) in 

the discussion of the article. 

5. Details 

Audit questions 

− Has the distribution of all the variables been reviewed 

− Were there variables with a high percentage of missing values If so, how were these 

dealt with 

− Have outliers been explored If so, how 

− Have the cell numbers for central variables been taken into consideration 

− Where relevant: How were (large) deviations from normality solved 

− Has been assessed whether the items belonging to a scale actually fit to the scale 

− 

6. Appendices/references/links 

7. Amendments 

V1.1: 1 Jan 2010: English translation. 

V1.0: 23 Apr 2007.

3 Initial data analysis - EMGO

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?