### Ofqual's ‘mutant’ A-level algorithm: a post-mortem

#### 6th September 2020

Students in London protest the 2020 A Level results which were awarded by an algorithm.

By Andrew Hillman

It is common for politicians to seek to sidestep responsibility when policy goes wrong. But the UK Government’s response to the A Level results U-turn was unusual because of how they sought to avoid blame: claiming to be powerless against an out-of-control computer program.

Prime Minister Boris Johnson said “grades were almost derailed by a mutant algorithm”. Even after the exam regulator Ofqual abandoned the approach, instead awarding grades purely based on assessments provided by teachers, Schools Minister Nick Gibb claimed that the method was fair – the algorithm was the problem. Speaking to BBC 5 Live, Gibb said:

“[The model] is a fair system in terms of delivering estimated grades. The issue that emerged on Thursday when the algorithm was published and we saw those tragic cases of young people getting their results was that there was something in the algorithm…The statistics authority are looking in detail at the algorithm to see why it did not deliver precisely what the model required it to do.”

At times, the Education Secretary Gavin Williamson even portrayed himself as the John Connor-styled hero in this Sci-Fi narrative, scuppering the algorithm’s malevolent plans in the nick of time. Williamson told Sky News that “it became apparent there were challenges within the algorithm when we were seeing results directly coming out” and that he decided to abandon Ofqual’s method once the scale of anomalous grades was identified. “When that was clear, we took action,” he said, adding that, “further action had to be taken; that’s what I did.”

The reality was very different. For one, Ofqual’s chair Roger Taylor later announced that it was the exam regulator’s decision to switch to teacher assessed grades, not Williamson’s.

But it was the algorithm’s depiction that was most factitious. The algorithm bore little resemblance to artificial intelligence systems, like Terminator’s Skynet, that evolve beyond their creators’ control. Ofqual’s algorithm did not conceal its intentions until judgment day – Johnson, Gibb and Williamson could have inspected the results and identified problematic patterns of results long before students received their grades.

In fact, Ofqual’s algorithm did nothing of its own accord – it involved no machine learning or “mutation”. In other words, the algorithm’s failures were entirely predictable based on its design – a design dictated by the Government.

Attempt to standardise results leads to biased model

In March, Williamson wrote to Ofqual asking the exam regulator to develop a system for allocating grades to students in the absence of exams, which had been cancelled due to Covid-19. He instructed Ofqual to calculate students’ results “based on their exam centres’ judgement of their ability” but also said the distribution of grades should be similar to previous years.

When results were released, many students were shocked to discover that their grades fell far below those predicted by their teachers. Since teachers had, on average, predicted that their students would have performed better than previous years, Ofqual decided to downgrade results. According to the regulator, 36% of results were one grade lower than the assessment provided by the school and 3.5% were two or more grades lower.

But Ofqual's methodology report, released alongside A-Level results, showed that downgrading was not applied uniformly: students in the smallest cohorts, which disproportionately included privately educated students, did not see their results adjusted at all.

Why would Ofqual design an algorithm that does not treat each student equally?

So why did Ofqual choose a bifurcated approach, where small and large cohorts were allocated grades using different criteria?

Alternative option 1 – teacher assessed grades

One alternative option was to allocate all grades based solely on unadjusted and unregulated teacher assessments – the system that eventually replaced Ofqual’s algorithm at the thirteenth hour. There were two problems with this approach.

Firstly, it would have created unfairness between graduating students. Some students would have received inflated grades because their teachers made more generous assessments, while other students would have been disadvantaged because their teachers modified their predictions to reflect the cohort’s historical results, as Ofqual had encouraged schools to do^{}.

Secondly, putting complete faith in teacher predicted grades creates unprecedented grade inflation^{}. In 2019, 76% of A-Level grades were C or above and 8% were A*. Based on teacher’s assessments, we will now have 87% of grades as C or better in 2020, including 14% as A*. Grade inflation could put students from other years at a disadvantage when applying for jobs, or undermine the legitimacy of 2020 results, leaving this year’s graduates without qualifications that are trusted by employers.

Data Source: Ofqual, Joint Council for Qualifications

Note: Data is for A Levels in England only.

Alternative option 2 – all students’ results calculated by Ofqual’s predictive model

So, if relying entirely on teacher assessed grades was undesirable, what about the opposite approach, where even grades in small cohorts were calculated using Ofqual's predictive model. The problem with this option was that small cohorts have very weak predictive data, reducing the model’s accuracy.

For example, at Bradford Girls Grammar School, 62% of English Literature students received an A or B grade in 2018; compared to just 17% in 2017 and 32% in 2019. If you were asked to predict the distribution of grades for the 2020 class, you would face a dilemma: are they more like the lower-attainment classes in 2017 and 2019, or like the high-performing class in the middle year?

Data Source:

Department for Education

The simple answer is you cannot possibly know for sure: the cohort size is too small to create a stable distribution of grades. In other words, it is not possible to separate the contribution of the school's English Literature teaching from the year-to-year variation in cohort ability. In this circumstance, teacher assessed grades will be more accurate than the predictive model, even if they are inflated.

(Note: you might be wondering if the variation in grades for English Literature at Bradford Girls Grammar School is typical of small cohorts or if I have picked a wild outlier to prove my point. The visualisation below gives the data for 100 representative small cohorts – ordered from most to least stable year-to-year results. Click on a cohort to observe the awarded grades.)

Data Source:

Department for Education

What were the problems with Ofqual’s model?

It seems then, that the mixed approach was an effective best-of-both-worlds method, using Ofqual's predictive model where possible, and falling back on teacher assessments when there was a lack of historical data. However, because this approach did not treat all students equally, it created the opportunity for systematic biases to sneak into the results.

Bias in favour of small cohorts

Because relatively few students study Music (5,000 students in 2020), cohort are likely to be smaller than for more popular subjects like Maths (85,000 students in 2020). Therefore, Ofqual awarded a greater proportion of results based on teacher assessment in Music than in Maths, leading to greater grade inflation. Compared to previous years, the percentage of students awarded an A* or A in Music rose by 23 percentage points, whereas for Maths the share of students receiving top grades actually fell slightly^{}.

Data Sources:

Department for Education, Joint Council for Qualifications

Note:

Small cohort % is based on estimates from 2017-2019 results data. Credit for first looking at small cohort % by subject to the FFT Education Data Lab.

Following the same logic, we would expect small schools, where a greater proportion of students are in small cohorts, to see bigger performance increases than larger schools. While Ofqual did not publish performance by school size, they did release performance data by school type. If cohort size was impacting results, we would expect greater grade inflation for private schools than for sixth form colleges, which typically have the largest cohorts.

Indeed, the data shows that private schools experienced a disproportionate increase in performance – the share of A*/A grades rose by 4.7 percentage points from 2019. In contrast, sixth form colleges saw a rise of just 0.3 percentage points. This was particularly problematic because privately educated students were already twice as likely to achieve an A* or A grade than those at sixth form colleges, so the systematic bias was favouring the already privileged.

Data Source:

Ofqual, Department for Education

Note:

The school types used by Ofqual differ from those published by the Department for Education so to estimate the small cohort % we needed to map school types based on reasonable assumptions.

Bias against outlying students

For cohorts of over 15 students, where Ofqual’s predictive model was solely responsible for awarding grades, the system’s fairness depended on the year-on-year stability of results. But many medium-sized cohorts exhibited significant variation between years, especially for the highest and lowest attainers.

I spoke to Thomas, a student at Queen Elizabeth Sixth Form in Darlington. His teachers assessed him as an A* student in English Literature and ranked him 6th out of a cohort of around 100 students. In 2017 and 2019, six English Literature students at the school were awarded an A*, but in 2018 the cohort had zero students receiving the top grade. As a result, Ofqual’s predictive model allocated the cohort just four A*s in 2020, and so Thomas was downgraded to an A.

Data Source:

Department for Education

“Using schools’ past performances produces an outcome that looks the same for the school, but isn’t fair at an individual level,” Thomas said. “They could have put more thought into how it would affect individual students.”

To meet his offer from Cambridge University, Thomas needed to achieve an A* in either English Literature or German. Since Thomas’ cohort for German was just six students, his grade was mostly dependent on his teacher assessment, which was also an A*. But because he was ranked 2nd in the cohort and the school had not had an A* student in German in recent years, Thomas was downgraded to an A again.

Thomas was relieved by the government U-turn, “Centre assessed grades aren’t fair, but they’re fairer than what we got [from the algorithm],” he said. However, Thomas was also frustrated that the Government acted so late, after Cambridge University had already filled up spaces with students who were awarded better grades by the algorithm. “It should have been fixed and rectified before results day,” Thomas said. “We’re now left in an awful position fighting over the handful of spaces that are available this year.”

(Note: you can see how much year-to-year grade variation there is for medium-sized cohorts in the visualisation below. The visualisation gives the data for 100 representative medium-sized cohorts – ordered from most to least stable year-to-year results. Click on a cohort to observe the awarded grades.)

Data Source:

Department for Education

The predictive model was even harsher on exceptional students at historically poorly performing schools. In medium-sized cohorts, students who were predicted A* and ranked 1st by their teachers could have their results downgraded if the school’s recent graduates had not consistently received A*s. As Richard Wilkinson, a Professor of Statistics at the University of Sheffield, tweeted, “This really hits social mobility. Essentially the algorithm regresses everyone to be more like the 'average' kid in their school. Which means if you're a bright kid in an underperforming school, you're particularly likely to suffer.”

Equally, if a year group did not have any struggling students, B or C grade students could end up at the bottom of the cohort’s ranking. If previous years’ students had been awarded E or U grades for that subject, the model could then predict that at least one 2020 student must be awarded an E or U. It is hard to understand why Ofqual’s algorithm was not modified to prevent this eventuality, for example, by capping downgrades at one or two grades.

What could Ofqual and the Department for Education have done differently?

Addressing individual unfairness

Where the ability of the 2020 cohort exceeded the performance of previous years, it is difficult to see how a predictive model could have prevented all unjust downgrading. As Guy Nason, Chair in Statistics at Imperial College London, explained, “One of the things going on here is that statistics is great at looking at overall populations and understanding overall behaviour. They're not always great at predicting at the individual level.”

But the government had other options. Sam Freedman, a former policy advisor for the Department for Education, argued that “pre-appeals” of A-Level results should have been permitted. This move, combined with greater transparency around how Ofqual's algorithm would work, could have allowed students who believed they would be unfairly penalised to pre-emptively challenge their grades^{}.

More flexibility could also have been given to schools. If a school felt they had an exceptional student in a cohort without top grades in previous years, Ofqual could have allowed the school to “reallocate” an A* grade from another subject. Equally, the regulator could have permitted schools to swap groups of grades – for example, if the predictive model allocated a pair of students A grades, their school could have instead awarded one A* and the other a B if those matched their teacher assessed grades.

Addressing system unfairness

Early in the process, Ofqual should have recognised a fundamental dilemma: the regulator did not have enough data to accurately standardise small cohort’s assessed grades, but if it did not standardise these assessments it would unfairly benefit particular groups of students, including the privately educated.

One solution was to collect more data. If Ofqual had asked teachers of small cohorts for granular grades (e.g. B+, B, B-, C+, C, C- options) and incorporated students’ individual GCSE results, this would have provided enough information to tentatively standardise small cohorts’ results.

Alternatively, Ofqual could have prioritised obtaining as accurate teachers assessed grades as possible from small schools. This could have been achieved by asking neighbouring small schools to oversee each other's assessments, requesting evidence justifying assessed grades where they differed from historical performance, or asking Ofsted to audit a sample of assessments from small schools.

“Still aspects of algorithm we do not full understand”

Possibly the biggest problem with Ofqual's approach was that by Thursday morning, as students were receiving their results and universities were using algorithm-determined grades to make life-altering decisions, nobody outside government knew how it worked. Nobody knew how the algorithm assigned grades, how accurate it was, or could even say with confidence that the process was fair and error-free.

The Royal Statistical Society had been calling for better external oversight since April. In a letter to Ofqual, the society’s Vice President, Sharon Witherspoon, recommended two society fellows who could provide external guidance. This offer was later retracted when Ofqual requested that advising statisticians sign non-disclosure agreements that Witherspoon said would have prevented them from “commenting in any way on the final choice of model for some years after this year's results were released.”

Nason was one of those society fellows. He feels that publishing the methodological “direction of travel” prior to results day would have “permitted the community and interested parties to make suggestions or raise queries that might have been helpful.” He added that, “There are still aspects of the algorithm that we do not fully understand. It would be interesting to learn from precisely where they have obtained advice from in the development of this algorithm.”

In July, the Education Select Committee called for complete transparency and the immediate publication of the model to “allow time for scrutiny”. It also stated that Ofqual should explain how the model “ensures fairness for schools without three years of historic data, and for settings with small, variable cohorts.”

Most damning was the Royal Statistical Society's letter sent one day after results were released, calling for the Office for Statistics Regulation to review Ofqual's approach, in which the Society said:

“We do not believe that the development of the statistical adjustment methodology has been transparent enough to meet our concerns about statistical quality or the need for greater involvement of knowledgeable external experts. We are sure that it has not been sufficiently transparent to meet the aim of being trustworthy in the broader sense.”

How accurate was Ofqual’s predictive model?

Ofqual has now published its methodology, and so we can evaluate the predictive model's accuracy. But first, consider what level of accuracy you would consider acceptable: if the model accurately predicted 90% of grades, would that be good enough? Even then, 15% of students studying three A Levels would receive at least one undeservedly low grade (with another 15% getting at least one undeservedly high grade).

In actuality, when the model was tested by retroactively applying it to predict 2019 exam grades, its accuracy was low – 56% for Politics, 62% for Geography and 61% for Maths^{}. In other words, the model had just a one in five chance of correctly predicting all three exam results for a typical student studying Politics, Geography and Maths.

But how did retroactive accuracy tests work given that teachers did not provide cohort rankings in 2019? As a proxy, Ofqual used student's actual exam results to form a cohort ranking. In other words, they assumed that teachers can perfectly predict the ordering of their student’s exam results, a wholly unrealistic assumption. Nason described this process of feeding exam results into a model designed to predict those exam results as “not really good statistical practice”. It means that the model's average accuracy is even lower than the 27%-68% range.

Data Source:

Ofqual

Note:

This accuracy refers to when the predictive model is used to allocate all grades. In reality, Ofqual did not use the model for cohorts of fewer than 5 students, and combined predictions with teacher assessments for cohorts of between 5 and 15 students.

Writing in The Guardian, Laura McInerney, co-founder of educational pollster Teacher Tapp, said, “No one was told how poor the guesses would be until Ofqual revealed its technical document on results day.” She added:

“There were plenty of alternatives. If the government had admitted the tentative nature of the results, and provided every student with the teacher predictions, rank, and their calculated grade, it would have been easier for universities to make reasoned judgments. If schools had received the results earlier, pre-appeals could have been made. More generosity could have been built into the system. None of this has happened.”

Individual outcomes and uncertainty are crucial characteristics of the problem, not products of a “mutant” algorithm

These measures would have made an algorithmic approach more tolerable. Ultimately, however, there is a strong argument that predictive modelling was a fundamentally ill-suited method for allocating grades, and that the Government should have had the foresight to reject it at the outset.

On Twitter, Hannah Fry, the writer of “Hello World: Being Human in the Age of Algorithms”, listed the elements that made Ofqual’s algorithm a “disaster waiting to happen”: “Impossible problem expecting a magical solution, over complicated solution, in-built bias, over-trust in equations, total lack of transparency, no easy way to appeal.” Tom Haines, a lecturer in machine learning at the University of Bath, described the process as “a terrible example of using artificial intelligence to make life altering decisions.”

Here is one way of thinking about it. Suppose a student was predicted ABB by their teachers. Imagine that if they had actually sat exams they would have achieved BBC, and that Ofqual's algorithm correctly predicts BBC. Even in this case, where the algorithm outperforms teachers, it has still failed the student. Why? Because individuals deserve to be in control of their own destiny. Receiving disappointing results based on an exam or a teacher assessment is very different from being allocated low grades simply because the student went to a large sixth form, or because two students in 2018 received Ds.

Ofqual’s chair Roger Taylor acknowledged this in a recent letter to the Education Select Committee, writing:

“While sound in principle, candidates who had reasonable expectations of achieving a grade were not willing to accept that they had been selected on the basis of teacher rankings and statistical predictions to receive a lower grade. To be told that you cannot progress as you wanted because you have been awarded a lower grade in this way was unacceptable and so the approach had to be withdrawn. We apologise for this. It caused distress to young people, problems for teachers, disrupted university admissions and left young people with qualifications in which confidence has been shaken.”

Taylor added that with hindsight, the “inherent limitations of the data” made delivering a fair algorithm for awarding grades an impossible task.

As shown above, it is easy to find specific examples where Ofqual’s algorithm produces judgments that are considerably less fair on individual students than if they received their teacher assessed grades. If ministers were focused on outcomes for individual students, they would have thought through these examples and recognised that a different approach was required. Williamson’s decision not to allow students a proper opportunity to appeal is further evidence that the strategy was not student focused.

Writing in the New Statesman, Lewis Goodall explained how reporting on the A Level U-turn helped him better understand the Windrush scandal:

“In spirit, if not in substance, this episode had many of its hallmarks: the same need to ‘standardise’, the same impersonal regard for circumstance, the same rigidity, the same fingerprints of alienation.”

Goodall could easily have been referring to the Department for Work and Pensions’ controversial work capability assessment or the inflexible bureaucracy that people need to navigate in applying for Universal Credit.

It is systems like these that make Johnson’s reference to a “mutant algorithm” and Nick Gibb’s claim that the problem was “something in the algorithm”, rather than the strategy outlined by ministers, so frustrating. The problems with Ofqual’s method were not unique consequences of using an algorithm, nor did they spontaneously arise at unfortunate moments. They were recurrent and predictable issues, arising directly from the creation of systems that rely on imperfect judgments without prioritising the most important question – what are the fairest and most humane outcomes possible for the individuals being judged?