“A Real-World Test of Artificial Intelligence Infiltration of a University Examinations System: A ‘Turing Test’ Case Study”, Peter Scarfe, Kelly Watcham, Alasdair Clarke, Etienne Roesch2024-06-26 ()⁠:

[OSF] The recent rise in artificial intelligence systems, such as ChatGPT, poses a fundamental problem for the educational sector. In universities and schools, many forms of assessment, such as coursework, are completed without invigilation. Therefore, students could hand in work as their own which is in fact completed by AI. Since the COVID pandemic, the sector has additionally accelerated its reliance on unsupervised ‘take home exams’. If students cheat using AI and this is undetected, the integrity of the way in which students are assessed is threatened.

We report a rigorous, blind study in which we injected 100% GPT-4 written submissions into the examinations system in 5 undergraduate modules, across all years of study, for a BSc degree in Psychology at a reputable UK university…across all years of study for a BSc degree in Psychology in the School of Psychology and Clinical Language Sciences (henceforth, ‘the School’) at the University of Reading (henceforth, ‘the University’). Markers of the exams were completely unaware of this

We found that 94% of our AI submissions were undetected.

The grades awarded to our AI submissions were on average half a grade boundary higher than that achieved by real students. Across modules, there was an 83.4% chance that the AI submissions on a module would outperform a random selection of the same number of real student submissions.

…By design, markers on the modules we tested were completely unaware of the project. Other than those involved in authorizing the study and the authors, only a handful of others were aware (eg. those who helped arrange paid marking cover for the additional AI submissions and those who created the special university student accounts needed for AI submissions). Study authorization did not require informed consent from markers. Following the analysis of the data, we invited all markers to two sessions chaired by our Head of School, to explain the study and gather feedback. Markers were very supportive and engaged in fruitful discussions. None were aware of the study having been run.

What were markers told about AI?

At the time of running the study, in the summer 2023, the use of AI to complete exams was not allowed and fell under the standard University academic misconduct policy, which stated that the work submitted by students had to be their own. The software systems used for exams submission and grading did not have an “AI detector” component. Colleagues received standard guidance from the School about how to spot poor academic practice and academic misconduct. This including, (1) checking if answers sounded “too good to be true” eg. a writing style, level of content, or quality, not expected from an undergraduate student completing a timed exam paper, (2) spotting answers which covered predominantly content which was not taught on the module, and (3) citations to references not linking with the claims being made in the answer. Many of these are characteristics of AI written text.

At the time, AI (particularly ChatGPT) was in the news media daily and an active topic of conversation amongst colleagues doing exam marking. The problem posed by AI for the academic integrity of assessments had also been discussed in larger meetings in the School. In debrief sessions given to colleagues who had marked on modules where we submitted AI (after the study had finished), virtually all were aware of the threat of AI to the integrity of exams. Indeed, in the few times where academic misconduct was suspected and reported, some colleagues referred to suspicions related to AI eg. answers that seemed too “good to be true”, cited esoteric literature not covered in the course or cited seemingly non-existent references. Some had also run exam questions through ChatGPT to compare with the suspicious answers and/or run suspicious answers through online “AI detectors”. Note that this was not used in a diagnostic fashion for academic misconduct.

…Therefore, pragmatically, it seems very likely that our markers graded, and did not detect, answers that students had produced using AI, in addition to our 100% AI generated answers. To counter this claim, one could argue that our AI submissions consistently outperformed real students, so this might suggest that AI was not used widely, else we would not have found this. An alternative argument is that students used AI, but in modifying the AI generated answers themselves made them worse than if that had simply entered the question and directly used the output unmodified as we did. Those student submissions which outperformed AI could also have consisted of AI generated material which students evaluated and verified, correcting factually incorrect information, to improve the AI response. Whatever the true nature of the reality in terms of the prevalence of use of AI, it is clear AI poses a serious threat to academic integrity.