Is meta-analysis all just ‘an exercise in mega-silliness’?

Featured Image

To my mind, there was something heroic about Gene Glass’ presidential address to the 1976 American Educational Research Association annual meeting. Prior to this, and dismayed by attacks on psychotherapy by psychologist Hans Eysenck, Glass and his colleague Mary Lee Smith manually searched over a thousand papers on the subject, identifying 375 to be suitable for their investigation into the effects of psychotherapy.

Working with often incomplete information from a multiplicity of sources, Glass applied his innovative statistical mind and, using a technique we now know as ‘meta-analysis’, found a sizeable treatment effect of psychotherapy; a result which has been confirmed by other researchers subsequently, but one derided at the time as ‘An Exercise in Mega-Silliness’ in a 1978 American Psychologist article written by the aforementioned Eysenck. Glass took a stand in San Francisco in 1976, and 43 years later the debate continues 5,000 miles away in Durham, UK.

Bringing order to mess and conflict

A decade after Glass’ AERA address, we find Larry Hedges – a rare breed of statistician – publishing ‘How Hard is Hard Science, How Soft is Soft Science?’ (Hedges, 1987). Among studies from both physics (the ‘hard’) and psychology (the ‘soft’), Hedges found meta-analytic techniques not only being used in the different disciplines (albeit with different names), he also identified similar degrees of disagreement (heterogeneity) in both the ‘hard’ and the ‘soft’ study findings. Hedges, like others working on research from medicine and education, saw the power of meta-analysis to bring order to the mess and conflict of myriad findings in a range of disciplines, as well as the limitations of the approach brought to life by Glass’ and Smith’s work.

Both the power and the limitations of meta-analysis must be understood if, today, it is to be used well by the increasing number of teachers, school leaders and policy-makers for whom it has become a component of decision-making (through engagement with summary databases such as the IES’ What Works Clearinghouse and the EEF’s Toolkit). In the field of psychology, much has been written about a so-called ‘replication crisis’ in which the reproducibility of published results has come into question (results which provide the very foundations of meta-analyses), and pressure applied on researchers to conduct more ‘exact’ replication studies. Now, as the spectre of this ‘crisis’ moves towards studies in education, how should we – as a community - respond?

Demand for more and better evidence

In schools and colleges around the world, teachers and leaders increasingly seek research evidence which is accessible, dependable, relevant to the students they teach, and which helps them make better decisions than they would be able to in its absence. That a growing number use the results of meta-analysis is not open to debate, and trying to stop the increasing application of research evidence in schools and colleges is fruitless. So how do we address a perceived, looming ‘crisis’ on the one hand, and a growing demand for more and better evidence on the other? I have four suggestions.

Firstly, let’s acknowledge that one potential cause of poor reproducibility of findings in education studies is treatment effect heterogeneity: the same treatment tested with different groups of students producing different results, even when common outcome measures are used appropriately. Most of the experiments we use in education are designed to find average treatment effects; rarely do they address the more interesting and important issue of detecting differences in treatment effects across different groups. So why not design this into our studies? Why not plan a sample which will help us understand both average treatment effects and answer the questions: why, for whom, and under what conditions? Research design and analysis should work in harmony, but often they don’t.

Secondly, all randomised controlled trials conducted in education should follow the CONSORT 2010 guidelines. The CONsolidated Standards Of Reporting Trials are intended to improve the quality of RCT reporting, “enabling readers to understand a trial’s design, conduct, analysis and interpretation, and to assess the validity of its results.” Alongside this, all trials in education (past, present and future) should be registered, and their full methods and summary results reported. At present, no such database for education trials exists.

Thirdly, if we are to continue using null hypothesis significance testing (and there are very good reasons not to in education research), let us adopt the recommendation (of some 80 researchers including Larry Hedges and E.-J. Wagenmakers) to “change the default P-value threshold for statistical significance for claims of new discoveries from 0.05 to 0.005” (Benjamin et al, 2018). While this is only one component of a much larger set of reforms I would like to see, it is one that could improve the reproducibility of findings by reducing the rate of false positives. That it propounds the continued use of NHST is, however, problematic, but maybe we need to take this one step at a time.

The burning questions

Finally, let us fund, design, execute and publish studies which address real problems faced in education, and answer the burning questions of teachers and school leaders. In doing so, let us harness the growing interest in evidence-informed practice in schools and colleges, adopt a user-focused design approach, and collaborate to create new ways of studying treatment effects so that the ensuing meta-analyses and systematic reviews which populate summary databases become more refined and more useful. Let us create new tools and processes to help networks of teachers and leaders work in collaboration with researchers to test theoretically-sound strategies and approaches in the classroom.


Evidence-informed classroom decisions require sound theory and high-quality evidence. Increasing numbers of teachers and leaders are engaging with research evidence in schools and colleges around the world, but the evidence base available to them is, itself, not yet fit-for-purpose. Many of the technical problems associated with meta-analysis have been known about for decades, yet they have not been addressed constructively and systematically.

And therein lies danger. We need to replicate Gene Glass’ innovative and problem-focused approach. We – the funders, researchers, methodologists, teachers, school leaders, publishers and policy-makers who each hold a small piece of the puzzle - need to take some risks. If, in another 40 years, we’re still talking about the same problems with trials and meta-analysis, the students who will enter our schools and colleges tomorrow, next year and in the ensuing decades will be justified in saying that we have let them down.

By setting aside ego and career goals, by leveraging technology and the power of networks, introducing new and better incentives for early career researchers to learn about “the new statistics” (Cumming, 2014), and by increasing the quality, accessibility and transparency of research findings, we have opportunities today that didn’t exist for Gene Glass and Mary Lee Smith back in 1976. We should take them sooner rather than later; it would be mega-silly not to.


Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E. J., Berk, R., ... & Cesarini, D. (2018). Redefine statistical significance. Nature Human Behaviour, 2(1), 6.

Cumming, G. (2014). The new statistics: Why and how. Psychological science, 25(1), 7-29.

Eysenck, H. J. (1978). An exercise in mega-silliness. American Psychologist, 33(5), 517.

Hedges, L. V. (1987). How hard is hard science, how soft is soft science? The empirical cumulativeness of research. American Psychologist, 42(5), 443.

About the author

Professor Stuart Kime (@ProfKime)

Stuart Kime is a qualified teacher who spent ten years teaching English and Drama in secondary schools. His interest in research focuses on assessment, teachers’ professional learning, and evaluation. At EBE, he is responsible for the design of all online and blended learning programmes.

Stuart is the author of the EEF’s Assessing and Monitoring Pupil Progress Guide, and co-author of the EEF’s DIY Evaluation Guide. He also wrote the National Toolkit of Common Evaluation Standards for Policing in the UK.

Formerly a Policy Fellow post in the UK Government’s Department for Education, Stuart is now a Visiting International Professor in the Hector Research Institute for Education Sciences and Psychology at the Eberhard Karls University, Tübingen, and an Honorary Professor in the School of Education at Durham University.


Find out more:

Read more about meta-analysis on the CEM blog:

Systematic Reviews and Weather Forecasts – how purpose shapes the significance of systematic reviews for different education stakeholders, By Philippa Cordingley, Paul Crisp & Steve Higgins

Meta-analysis: Don’t do it or Do it more carefully? By Philippa Cordingley

Sign up for regular updates on the CEM blog