Any good assessment is a balance between reliability and validity.
A lot has already been written about this. Dylan William discusses the complexity of the relationship between reliability and validity in Reliability, Validity and all that Jazz, where he refers to the ‘tension’ between them. Phil Stock’s blog series Principles of Great Assessment covers both ‘thorny and complicated’ areas in detail. Professor Rob Coe also covered the trade-off between reliability and validity in assessment design in his presentation at ResearchEd.
So, we’ve all heard of them and know that they are important, but what does it mean to say that an assessment is “reliable” or “valid”?
Reliability relates to repeatability and consistency. We can ask:
- If I repeated the same assessment, would I get the same results?
- If two students have similar ability in the subject, would they get similar results?
There are a number of ways we can measure reliability. Ideally having more than one measure, each one with a high correlation, would be good evidence of a reliable assessment.
The most commonly used measure of reliability is a value called Cronbach’s alpha.
This is a measure of the internal reliability of an assessment. Cronbach’s alpha measures whether an assessment appears to be consistently measuring the same thing. It is the average of all the split-half correlations for the assessment.
Most statistics packages will calculate Cronbach’s alpha for you, and typically the results will tell you how good your test is in terms of a correlation where 1 is the best you can get and 0 is terrible.
(There are some rules of thumb as to what is a good measure. Even on Wikipedia.)
Another method that could be used is a test/retest reliability, comparing the results of people taking a test and then taking the same test again. A set of students could sit a test and then sit the same test two days later. The scores from the first test would then be compared with the scores of the second test. A high correlation between the two indicates a reliable test.
Inter-rater reliability is the extent to which two or more raters (or observers, teachers, examiners) agree. It addresses the issue of consistency of the implementation of a particular rating system.
For example, if you have a test that requires marking by humans you could use inter-rater reliability. Ask at least two people to rate a set of students’ results and compare the results of each rater. If they tend to agree, then you have another reliability measure.
A simple and traditional way of looking at validity is to ask: “Is the assessment testing what we think it is testing?” Validity is a hotly contested topic amongst academics at the moment and has been for many years.
The term ‘validity’ is increasingly being used not to refer to the function of the test itself, but how the results are interpreted. As such, asking whether an assessment is valid is fruitless, because the results of the assessment will be valid for some purposes and not for others.
Given the argument, we can take a step back and mention some of the key aspects that make this particular measure important.
As a teacher, it is common to look at an assessment and the questions it contains and make a decision about it based on how it looks.
For example, if it is a geography assessment, does it look like a geography assessment? Face validity is very commonly used, but isn’t very scientific because it can’t easily be measured.
Imagine that you have created a test that assesses a student’s ability to solve quadratic equations, and that the resulting order of marks matches the order of marks produced by someone else’s test of quadratic equations. This is concurrent validity.
Concurrent validity is the degree to which the results of one test correlate with other tests that measure the same thing.
Sometimes we want an assessment to be an accurate prediction of future achievement. Many of the assessments produced by CEM, such as MidYIS and Alis, are designed to have a high degree of predictive validity.
An assessment that is designed to predict how a student will perform, for example at the end of Key Stage 2, or GCSE, or A level, is said to have predictive validity.
Predictive validity means that it is possible to make an educated guess at the most likely outcome, in terms of the grade/level, a student will achieve in the future based on their performance now.
In an ideal world
Assessments can be reliable but not valid, but they can’t be valid without also being reliable.
In an ideal world, an assessment should be both highly reliable and highly valid. Admittedly, this is difficult to achieve due to the tension that exists between the two.
However, many of the measures used in creating a good assessment don’t require a high degree of mathematical knowledge or understanding, just the ability to see how well two sets of numbers relate to each other.
Find out more about what makes a good assessment in Rob Coe’s blog posts:
Reporting the evidence: what research can tell us about how assessment data is used
Katharine Bailey is Director of Policy here at CEM, and for many years she has been working...
Why assessment may tell you less than you think – Part 1
By Rob Coe I was inspired to write this blog by reading Harry Fletcher-Wood's book ‘Responsive...
Why assessment may tell you less than you think – Part 2
By Rob Coe In part 1 of this blog post, I posed five questions about these uses of assessments. I’m...