Would you let this test into your classroom?

Featured Image

By Professor Robert Coe

In England, the government has announced the end of using levels for assessment. If that means an end to meaningless numbers based on unstandardized, impressionistic, selective and biased judgements that fail to capture true learning, it is a good thing. But will it? And what have we got that is better?

As schools start to confront the reality of having to design their own assessment systems, or adopt them from elsewhere, two things have become clear to me. The first is that in assessment, quality matters. The difference between good and bad assessment is huge and it makes an important difference. The second is that the understanding of what makes one assessment good and another bad, and the ability to use that understanding to make good choices, seems to be very thinly spread.

Understanding quality of assessment is relevant to a range of educational choices, not just moving beyond levels. For example, as we await the government response to the consultation on Primary Accountability, understanding what kinds of assessment would be suitable to use as a baseline for children starting school is crucial and assessment is not just something we do to students. Recent debates about the value of lesson observation (I wrote about this in a previous blog) also depend on understanding how teaching can (and cannot) be assessed.

I have to confess to some self-interest here. As Director of a research centre that has been providing some of the highest quality assessments in the world for use by teachers for 30 years, I clearly want teachers to be able to appreciate quality assessment and make good choices. Unfortunately in recent years, at least among state schools in England, a range of pressures have meant that quality has not been a guarantee of popularity. I also teach on what I believe is an excellent Masters course in Educational Assessment (students say this too!). This year, for the first time, the cohort of students on the course contains not a single practising teacher in a UK school. People in professions other than teaching, or teaching in other countries, are interested in understanding quality in assessment; why not teachers here?

What follows is an attempt to define what it means to say an assessment is good, in the form of a 47-question checklist. Yes, it is long and contains some technical terms. If you don’t understand them, or why there have to be so many criteria for quality, perhaps you should sign up for an MSc in Educational Assessment? If you are a teacher, and considering using a particular assessment, these are the questions you should ask before you let into your classroom.

Construct Definition

1. What does the assessment claim to measure?

2. What do the outcomes (scores, levels, grades, categories) tell you? How can outcomes be interpreted? Can you describe qualitatively the difference between a person with a higher score and one with a lower score?

3. What uses of (or decisions based on) these outcomes are appropriate/inappropriate?

4. If there are any subscales, how should they be interpreted?

5. If key performance thresholds are specified (eg pass/fail, basic/proficient/advanced), is it clear what they mean and how they are set?

6. How clearly defined are the acceptable interpretations and uses of assessment outcomes? Does the assessment provider (or creator) give explicit guidance on what interpretations are supported?

7. Is there appropriate guidance on likely, but unintended, incorrect interpretations of scores?

8. For what populations or contexts are these interpretations and uses claimed to be appropriate?

Construct validity

9. How is the construct expected to be related to other measures or constructs, according to theory?

10. To what extent are these expected relationships (or lack of relationships) confirmed by evidence?

11. How well does the measure correspond with measures of the same and related constructs, using the same and other methods of assessment? (eg from multi-trait multi-method analysis)

Content validity

12. Do the assessment items/tasks look appropriate?

13. Does the structure of the assessment and tasks look appropriate? (eg overall length, balance of different styles/modes of assessment)

14. Does the marking/scoring look appropriate?

15. Are the assessed behaviours all within the intended construct domain?

16. Could outcomes be influenced by any confounds or spurious characteristics/abilities?

17. Does what is assessed cover the full range of the intended construct domain, both in terms of content and level? Are there any gaps in what is not assessed (or not reflected in assessment outcomes)?

Criterion-related validity

18. What (if anything) do assessment outcomes predict?

19. How well do the assessment outcomes predict later performance on valued outcomes? (eg on national or high stakes assessments such as KS SATS, GCSE, GCE)

20. How well do the assessment scores correlate with other measures of the same thing?

21. For what samples (from specific populations/contexts) have these correlations been demonstrated? With what time intervals between measures? Are they appropriate? Adequate? (Do they match with q8?)


22. Do repeated administrations of the assessment give consistent results? What test-retest correlations are reported? With what populations, contexts, samples and time interval between assessment? (Do they match with q8?)

23. To what extent is the assessment outcome dependent on the particular items included? What internal consistency measures are reported (eg Cronbach's alpha or Rasch person-reliability)? With what populations, contexts, samples? (Do they match with q8?)

24. If outcomes (or any part of them) are dependent on rater judgements, to what extent are they dependent on who judges? What inter-rater correlations (or other measures) are reported? With what populations, contexts, samples? (Do they match with q8?)

25. What sources of ‘random’ error might contribute to imprecision in the scores? (eg assessment occasion, item selection, marker). Are the relative sizes of their contribution to uncertainty in score estimates judged appropriately?

26. With what level of precision do assessment scores estimate the underlying trait? What estimates of the standard error of measurement are given? From what populations, contexts, samples? (Do they match with q8?)

27. If assessment performance categories are used to make different decisions or inferences, does the assessment discriminate adequately between these different levels of performance? What proportion of candidates would be correctly classified?

Freedom from biases

28. Could the assessment unfairly favour some groups over others (such as gender, social class, race/ethnicity)?

29. Could assessment outcomes be confounded with factors other than the intended construct (eg reading ability, fluency in English, cultural knowledge, attitudes or dispositions)?

30. Is there a list of potential confounds (as in q29)? For each, what evidence is available to indicate whether assessment outcomes may be confounded? This might include differential predictive validity (ie coincidence of regression lines), DIF (differential item functioning) and basic descriptive statistics.


31. For what ranges (age, abilities, etc) is the assessment appropriate?

32. Is it free from ceiling/floor effects?

33. How does the accuracy of the assessment (eg Information Function or Standard Error of Measurement) vary across the range?


34. Is there any way candidates (or anyone else helping them) could cheat?

35. If the assessment has high stakes for test takers, teachers or others, could it be possible for them to achieve scores that would not genuinely reflect the construct being measured (as defined in A)?

36. Could attaching high-stakes consequences to assessment outcomes affect any of the previous answers?

37. If the assessment might be used as the outcome measure in an impact evaluation, is the assessment 'objective', in the sense that it cannot be influenced by the expectations or desires of an 'unblinded' judge or assessor (ie one who knew which children had received which treatments)?

Educational value

38. Does the process of taking the assessment, or the feedback it generates, have direct value to teachers and learners?

39. Is it perceived positively by participants (eg students, teachers, parents)?


40. Is the assessment accessible to candidates with special needs? (eg visual impairment, deafness, dyslexia, physical disability, English as an additional language)

41. Are appropriate adaptations or guidance available for these needs?

Administration requirements

42. How long does the assessment (or each element of it) take each candidate?

43. Is any additional time required to set it up?

44. Does the assessment have to be invigilated or administered by a qualified/trained person?

45. Does it require computers or other technology/resources?

46. Can it be administered to groups or just individuals?

47. Do the responses have to be marked? How much time is needed for this?