Why assessment may tell you less than you think – Part 1

Featured Image

By Rob Coe

I was inspired to write this blog by reading Harry Fletcher-Wood's book ‘Responsive Teaching’. I really enjoyed the book and would recommend it to any teacher who wants to make ‘formative assessment’ work in their classroom.

But it got me thinking about the limits of using single-question assessments, such as hinge questions and exit tickets (which Harry discusses at length), as a basis for decisions about what to teach to whom.

I have written before about the need for assessments to contain information, so what I say here builds on that. But, before I tell you what I think about this, I’d like to know what you think. Below are five questions about these uses of assessments. Please do answer the first 4 questions and your answers will be automatically captured.

How precise is a test score?

1. You have created a 20-item right-wrong test and given it to your class of 30 pupils. How far apart do two scores have to be before you can be confident one is really better than the other?

How much information is there in a ‘hinge question’?

2. You have taught and seen a student working on this topic and estimate their probability of mastery at 80%, but they get the hinge question wrong. What is your new estimate of their probability of mastery of the concept?

How should you respond to an exit ticket?

3. You ask a question as an exit ticket at the end of a lesson. 25 of a class of 30 get it right. Next lesson, would you
4. If the number getting it right had been lower than 25, would you have done something different?
5. How much lower would it have to be to change your response?


A bit of assessment theory …

As I read Harry’s book I had this image in my head:


This is called an ‘Item Characteristic Curve’, or ICC, and it shows the relationship between a person’s ‘true’ ability (or knowledge, understanding, competence, etc – whatever the underlying construct is that you are trying to assess) and their probability of getting the item (ie question) correct.

In a naïve understanding of assessment, students either know it or they don’t: if they do then they should get the question right, if they don’t then they shouldn’t. But the reality is captured by this smooth curve, the ICC, that shows how the probability gradually increases with ‘ability’. If your knowledge is such that you have an 80% chance of a correct answer, then one time in five you will get that question wrong.

So has that person ‘got it’ or not? The probability approaches one as your ability increases, but never actually reaches it, so no one is certain to get any question right. A more difficult question will have the same shaped curve, but shifted to the right.

Here is an example of an Item Characteristic Curve from a real assessment:

blogpic2 On this graph:

  • The horizontal axis shows the total score on the whole test (out of 72) for a sample of 7000 respondents.
  • The blue shaded areas show the distribution of people who got this particular question right (at the top) and wrong (at the bottom).
  • The height of each of the black diamonds shows the proportion of people with that total score who got this item correct.
  • The green line shows what we would expect if the item fits our assessment model perfectly, and the green shading a 95% confidence interval around that expectation.

We can see that the observed data (black diamonds, and the black line that is a smoothed trend of the diamonds) are almost exactly what we would expect (where the diamonds are outside the confidence limits they are coloured red).

We can also see that although this is not a hard item (overall, 67% got it right), even among the top scorers on the test some have got it wrong: the two blue distributions overlap quite a bit.

And we need some of those top scorers to get it wrong, and some of the low scorers to get it right, if the black line is to follow the green.

Task performance, inferences and decisions

Assessment is the process of capturing and scoring aspects of task performance in order to support inferences. Inferences are usually about a person and are time-bound. For example: ‘She doesn’t currently understand this’ or ‘He will make a good employee’. Decisions may be informed by those inferences; respectively, ‘I need to re-teach it’ or ‘Offer him the job’.

Given the multiplicity of the kinds of decisions or inferences an assessment may support over different time scales, I remain to be convinced that a binary distinction between ‘formative’ and ‘summative’ assessment is ever really helpful.

Here are some further differences between the ‘common sense’ view of assessment and what is actually seen when we look at assessment data:

Common sense view Reality
Learners either understand something or they don’t, in a binary sort of way Understanding is best seen as a continuum, imperfectly observed
Once you ‘get it’ you never go back (cf threshold concepts) There is no observable behaviour that corresponds to having ‘got it’, but what is observable is very erratic
The subjective feeling of ‘mastery’ is a good guide to learning It just isn’t
Whether someone gets a question right or completes a task satisfactorily mostly depends on their knowledge, understanding, competence, etc So many other factors affect responses and scores. Signal: noise ratio is often woeful


Before explaining my answers to the questions I posed, I should say that I recognise that there are many reasons for asking questions in lessons that may not be for assessment – in the sense of supporting inferences or decisions.

We may use questions for retrieval practice or just to maintain attention; they may be a device to elicit students’ thinking or to provoke cognitive challenge or dialogue.

The questions I posed at the beginning are not intended as an assessment, but as a pedagogical device.

Answers to follow soon…