'Effect Size' is a way of expressing the difference between two groups. In particular, if the groups have been systematically treated differently in an experiment, the Effect Size indicates how effective the experimental treatment was.
What is an ‘Effect Size’?
A Brief Introduction
Robert Coe, CEM Centre, Durham University
‘Effect Size’ is simply a way of quantifying the difference between two groups. For example, if one group has had an ‘experimental’ treatment and the other has not (the ‘control’), then the Effect Size is a measure of the effectiveness of the treatment.
Effect Size uses the idea of ‘standard deviation’ to contextualise the difference between the two groups. Standard deviation is a measure of how spread out a set of values are. Various formulae for calculating it can be found in any statistics text book, or if data are entered into a spreadsheet such as Excel, a built-in formula can be used.
Alternatively, standard deviation can be interpreted graphically. Many datasets have a distribution similar to that shown in Figure 1. For these, the standard deviation is the distance you have to go either side of the mean (average) in order to include 68% of the population. If you go twice this distance (two standard deviations), then you can expect to include 95% of the population.
The Effect Size is just the difference between the mean values of the two groups, divided by the standard deviation (Equation 1).
Consider an experiment conducted by Val Dowson to investigate time of day effects on learning: do children learn better in the morning or afternoon? A group of 38 children were included in the experiment. Half were randomly allocated to listen to a story and answer questions about it at 9am, the other half to hear exactly the same story (on tape) and answer the same questions at 3pm. Their comprehension was measured by the number of questions answered correctly out of 20.
The average score was 15.2 for the morning group, 17.9 for the afternoon group: a difference of 2.7. But how big a difference is this? If the outcome were measured on a familiar scale, such as GCSE grades, interpreting the difference would not be a problem. If the average difference were, say, half a grade, most people would have a fair idea of the educational significance of the effect of reading a story at different times of day. However, in many experiments there is no familiar scale available on which to record the outcomes. The experimenter often has to invent a scale or to use (or adapt) an already existing one – but generally not one whose interpretation will be familiar to most people.
Using Effect Size helps to overcome this difficulty, since if we know the spread of scores (ie the standard deviation), it will help us to put the difference into context. In Dowson’s time-of-day effects experiment, the standard deviation (SD) = 3.3, so the Effect Size was (17.9 – 15.2)/3.3 = 0.8.
Interpreting Effect Sizes
Provided our data have the kind of distribution shown in Figure 1 (a ‘Normal’ distribution), we can readily interpret Effect Sizes in terms of the amount of overlap between the two groups.
For example, an effect size of 0.8 means that the score of the average person in the experimental group exceeds the scores of 79% of the control group. If the two groups had been classes of 25, the average person in the ‘afternoon’ group (ie the one who would have been ranked 13th in the group) would have scored about the same as the 6th highest person in the ‘morning’ group. Visualising these two individuals can give quite a graphic interpretation of the difference between the two effects.
Table 1 shows conversions of effect sizes to percentiles (column 2) and the equivalent change in rank for the average person in a group of 29 (column 3). Notice that an effect-size of 1.6 would raise the average person to be level with the top ranked individual in the control group, so effect sizes larger than this are illustrated in terms of the top person in a larger group. For example, an effect size of 3.0 would bring the average person in a group of 740 level with the previously top person in the group.
Percentage of control group who would be below average person in experimental group
Rank of person in a control group of 29 who would be equivalent to the average person in experimental group
1 (or 1st out of 44)
1 (or 1st out of 160)
1 (or 1st out of 740)
Another way to interpret effect sizes is to compare them to the effect sizes of differences that are familiar. For example, an effect size of 0.2 corresponds to the difference between the heights of 15 year old and 16 year old girls in the US. A 0.5 effect size corresponds to the difference between the heights of 14 year old and 18 year old girls. An effect size of 0.8 equates to the difference between the heights of 13 year old and 18 year old girls.
The distribution of GCSE grades in compulsory subjects (ie Maths and English) have standard deviations of between 1.5 – 1.8 grades, so an improvement of one GCSE grade represents an effect size of 0.5 – 0.7. In the context of a secondary school therefore, introducing a change in practice whose effect size was known to be 0.6 would be likely to result in an improvement of about a GCSE grade for each pupil in each subject. For a school in which 50% of pupils were previously gaining five or more A* – C grades, this percentage (other things being equal, and assuming that the effect applied equally across the whole curriculum) would rise to 73%. Even what would what would generally be considered a ‘small’ effect of 0.2 would produce an increase from 50% to 58% – a difference that most schools would probably categorise as quite substantial.