Adaptive comparative judgement

Adaptive comparative judgement is a technique borrowed from psychophysics which is able to generate reliable results for educational assessment – as such it is an alternative to traditional exam script marking. In the approach, judges are presented with pairs of student work and are then asked to choose which is better, one or the other. By means of an iterative and adaptive algorithm, a scaled distribution of student work can then be obtained without reference to criteria.

Introduction
Traditional exam script marking began in Cambridge 1792 when, with undergraduate numbers rising, the importance of proper ranking of students was growing. So in 1792 the new Proctor of Examinations, William Farish, introduced marking, a process in which every examiner gives a numerical score to each response by every student, and the overall total mark puts the students in the final rank order. Francis Galton (1869) noted that, in an unidentified year about 1863, the Senior Wrangler scored 7,634 out of a maximum of 17,000, while the Second Wrangler scored 4,123. (The 'Wooden Spoon' scored only 237.)

Prior to 1792, a team of Cambridge examiners convened at 5pm on the last day of examining, reviewed the 19 papers each student had sat – and published their rank order at midnight. Marking solved the problems of numbers and prevented unfair personal bias, and its introduction was a step towards modern objective testing, the format it is best suited to. But the technology of testing that followed, with its major emphasis on reliability and the automatisation of marking, has been an uncomfortable partner for some areas of educational achievement: assessing writing or speaking, and other kinds of performance need something more qualitative and judgemental.

The technique of Adaptive Comparative Judgement is an alternative to marking. It returns to the pre-1792 idea of sorting papers according to their quality, but retains the guarantee of reliability and fairness. It is by far the most reliable way known to score essays or more complex performances. It is much simpler than marking, and has been preferred by almost all examiners who have tried it. The real appeal of Adaptive Comparative Judgement lies in how it can re-professionalise the activity of assessment and how it can re-integrate assessment with learning.

Thurstone's law of comparative judgement
"'There is no such thing as absolute judgement'"

The science of comparative judgement began with Louis Leon Thurstone of the University of Chicago. A pioneer of psychophysics, he proposed several ways to construct scales for measuring sensation and other psychological properties. One of these was the law of comparative judgment (Thurstone, 1927a, 1927b), which defined a mathematical way of modeling the chance that one object will 'beat' another in a comparison, given values for the 'quality' of each. This is all that is needed to construct a complete measurement system.

A variation on his model (see Pairwise comparison and the BTL model), states that the difference between their quality values is equal to the log of the odds that object-A will beat object-B:



\mathrm{log\;odds}(A\ \text{beats}\ B\mid v_a,v_b)=v_a-v_b $$

Before the availability of modern computers, the mathematics needed to calculate the 'values' of each object's quality meant that the method could only be used with small sets of objects, and its application was limited. For Thurstone, the objects were generally sensations, such as intensity, or attitudes, such as the seriousness of crimes, or statements of opinions. Social researchers continued to use the method, as did market researchers for whom the objects might be different hotel room layouts, or variations on a proposed new biscuit.

In the 1970s and 1980s, comparative judgement appeared, almost for the first time in educational assessment, as a theoretical basis or precursor for the new Latent Trait or Item Response Theories. (Andrich, 1978). These models are now standard, especially in item banking and adaptive testing systems.

Re-introduction in education
The first published paper using Comparative Judgement in education was Pollitt & Murray (1994), essentially a research paper concerning the nature of the English proficiency scale assessed in the speaking part of Cambridge's CPE exam. The objects were candidates, represented by 2-minute snippets of video recordings from their test sessions, and the judges were Linguistics post-graduate students with no assessment training. The judges compared pairs of video snippets, simply reporting which they thought the better student, and were then clinically interviewed to elicit the reasons for their decisions.

Pollitt then introduced Comparative Judgement to the UK awarding bodies, as a method for comparing the standards of A Levels from different boards. Comparative judgement replaced their existing method which required direct judgement of a script against the official standard of a different board. For the first two or three years of this Pollitt carried out all of the analyses for all the boards, using a program he had written for the purpose. It immediately became the only experimental method used to investigate exam comparability in the UK; the applications for this purpose from 1996 to 2006 are fully described in Bramley (2007).

In 2004, Pollitt presented a paper at the conference of the International Association for Educational Assessment titled Let's Stop Marking Exams, and another at the same conference in 2009 titled Abolishing Marksism. In each paper the aim was to convince the assessment community that there were significant advantages to using Comparative Judgement in place of marking for some types of assessment. In 2010 he presented a paper at the Association for Educational Assessment – Europe, How to Assess Writing Reliably and Validly, which presented evidence of the extraordinarily high reliability that has been achieved with Comparative Judgement in assessing primary school pupils' skill in first-language English writing.

Adaptive comparative judgement
Comparative judgement becomes a viable alternative to marking when it is implemented as an adaptive web-based assessment system. In this, the 'scores' (the model parameter for each object) are re-estimated after each 'round' of judgements in which, on average, each object has been judged one more time. In the next round, each script is compared only to another whose current estimated score is similar, which increases the amount of statistical information contained in each judgement. As a result, the estimation procedure is more efficient than random pairing, or any other pre-determined pairing system like those used in classical comparative judgement applications. (Pollitt, 2012).

As with computer-adaptive testing, this adaptivity maximises the efficiency of the estimation procedure, increasing the separation of the scores and reducing the standard errors. The most obvious advantage is that this produces significantly enhanced reliability, compared to assessment by marking, with no loss of validity.

Whether adaptive comparative judgement genuinely increases reliability is not certain. (Bramley, Vitello, 2016).

RM Compare
RM Compare is the original adaptive comparative judgement system. The system, originally developed as CompareAssess by the company Digital Assess, formerly TAG Developments, and is designed to run at scale deployments of Adaptive Comparative Judgements and has been used around the world in a wide range of contexts.

Comparative Judgement
No More Marking have created an online Comparative Judgement application, along with a repository of useful information.

e-scape
The first application of Comparative Judgement to the direct assessment of students was in a project called e-scape, led by Prof. Richard Kimbell of London University's Goldsmiths College (Kimbell & Pollitt, 2008). The development work was carried out in collaboration with a number of awarding bodies in a Design & Technology course. Kimbell's team developed a sophisticated and authentic project in which students were required to develop, as far as a prototype, an object such as a children's pill dispenser in two three-hour supervised sessions.

The web-based judgement system was designed by Karim Derrick and Declan Lynch from TAG Developments, now a part of Digital Assess, and based on the original MAPS (software) assessment portfolio system, now known as Manage. Goldsmiths, TAG Developments and Pollitt ran three trials, increasing the sample size from 20 to 249 students, and developing both the judging system and the assessment system. There are three pilots, involving Geography and Science as well as the original in Design & Technology.

Primary school writing
In late 2009, TAG Developments and Pollitt trialled a new version of the system for assessing writing. A total of 1000 primary school scripts were evaluated by a team of 54 judges in a simulated national assessment context. The reliability of the resulting scores after each script had been judged 16 times was 0.96, considerably higher than in any other reported study of similar writing assessment. Further development of the system has shown that reliability of 0.93 can be reached after about 9 judgements of each script, when the system is no more expensive than single marking but still much more reliable.

Further projects
Several projects are underway at present, in England, Scotland, Ireland, Israel, Singapore and Australia. They range from primary school to university in context, and include both formative and summative assessment, from writing to mathematics. The basic web system is now available on a commercial basis from TAG Assessment (http://www.tagassessment.com), and can be modified to suit specific needs.

ACJ has been used by Seery, Canty, Gordon and Lane in the University of Limerick, Ireland to assess undergraduate student work on Initial Teacher Education programmes since 2009. ACJ has also been used by Dr. Bartholomew at Purdue University to assess design portfolios in middle, high-school, and university students. Bartholomew has also used ACJ as a formative assessment teaching and learning tool for open-ended problems.