A bit of assessment theory …
As I read Harry’s book I had this image in my head:

This is called an ‘Item Characteristic Curve’, or ICC, and it shows the relationship between a person’s ‘true’ ability (or knowledge, understanding, competence, etc – whatever the underlying construct is that you are trying to assess) and their probability of getting the item (ie question) correct.
In a naïve understanding of assessment, students either know it or they don’t: if they do then they should get the question right, if they don’t then they shouldn’t. But the reality is captured by this smooth curve, the ICC, that shows how the probability gradually increases with ‘ability’. If your knowledge is such that you have an 80% chance of a correct answer, then one time in five you will get that question wrong.
So has that person ‘got it’ or not? The probability approaches one as your ability increases, but never actually reaches it, so no one is certain to get any question right. A more difficult question will have the same shaped curve, but shifted to the right.
Here is an example of an Item Characteristic Curve from a real assessment:

On this graph:
- The horizontal axis shows the total score on the whole test (out of 72) for a sample of 7000 respondents.
- The blue shaded areas show the distribution of people who got this particular question right (at the top) and wrong (at the bottom).
- The height of each of the black diamonds shows the proportion of people with that total score who got this item correct.
- The green line shows what we would expect if the item fits our assessment model perfectly, and the green shading a 95% confidence interval around that expectation.
We can see that the observed data (black diamonds, and the black line that is a smoothed trend of the diamonds) are almost exactly what we would expect (where the diamonds are outside the confidence limits they are coloured red).
We can also see that although this is not a hard item (overall, 67% got it right), even among the top scorers on the test some have got it wrong: the two blue distributions overlap quite a bit.
And we need some of those top scorers to get it wrong, and some of the low scorers to get it right, if the black line is to follow the green.
Task performance, inferences and decisions
Assessment is the process of capturing and scoring aspects of task performance in order to support inferences. Inferences are usually about a person and are time-bound. For example: ‘She doesn’t currently understand this’ or ‘He will make a good employee’. Decisions may be informed by those inferences; respectively, ‘I need to re-teach it’ or ‘Offer him the job’.
Given the multiplicity of the kinds of decisions or inferences an assessment may support over different time scales, I remain to be convinced that a binary distinction between ‘formative’ and ‘summative’ assessment is ever really helpful.
Here are some further differences between the ‘common sense’ view of assessment and what is actually seen when we look at assessment data:
Common sense view |
Reality |
Learners either understand something or they don’t, in a binary sort of way |
Understanding is best seen as a continuum, imperfectly observed |
Once you ‘get it’ you never go back (cf threshold concepts) |
There is no observable behaviour that corresponds to having ‘got it’, but what is observable is very erratic |
The subjective feeling of ‘mastery’ is a good guide to learning |
It just isn’t |
Whether someone gets a question right or completes a task satisfactorily mostly depends on their knowledge, understanding, competence, etc |
So many other factors affect responses and scores. Signal: noise ratio is often woeful |
Before explaining my answers to the questions I posed, I should say that I recognise that there are many reasons for asking questions in lessons that may not be for assessment – in the sense of supporting inferences or decisions.
We may use questions for retrieval practice or just to maintain attention; they may be a device to elicit students’ thinking or to provoke cognitive challenge or dialogue.
The questions I posed at the beginning are not intended as an assessment, but as a pedagogical device.
Answers to follow soon…