Claims that artificial intelligence (AI) systems have passed some licensure exams have raised concerns about exam validity and test security. However, doubts remain over how AI systems were used in these exams and the extent to which researchers had to use judgment to identify correct versus incorrect responses. This article is by Lorin Mueller, FSBPT Managing Director of Assessment.
Recent news reports have promoted the idea that ChatGPT and its successor, GPT-4, have been shown to pass some licensure exams. For example, a recent paper claimed that GPT4 is able to pass the Multistate Bar Examination (MBE).Other researchers are examining AI’s ability to pass other licensing exams, and the GPT-4 technical report shows that the latest version does very well across a range of standardized tests.
That is a disturbing claim to those concerned with exam validity and test security, including FSBPT. Could someone pass the NPTE by relying on generative text AI to determine the correct answers? Unfortunately, the real answer is we don’t know. However, there is a lot to be skeptical about in some of these recent claims.
The short answer is that we don’t know. OpenAI did not design ChatGPT and GPT4 to take these tests. If you type in a multiple-choice question, you have to take steps to get these AI systems to generate an answer. Additionally, it’s unclear how much manipulation of the questions the researchers had to do to get the interface to recognize the question and generate an answer. Beyond that, we do not know to what extent the researchers had to use judgment to identify correct versus incorrect responses. In fact, the aforementioned GPT-4 technical report states that letter responses are “synthesized” from textual responses, but doesn’t go a step further and report the extent to which different members of the research team agreed on the synthesis and whether or not the researchers were blind to the key. That is not to question the integrity of these researchers, but to say that their focus is to demonstrate the maximal capabilities of these tools, rather than to test them against a standardized criterion using a pre-defined process (such as a standardized, proctored, and timed test administration).
The answer to that question isn’t clear either, but drilling down into the research makes the interpretation of these results a lot less earth-shaking.
First, high scores should not be surprising for norm-referenced general educational exams like the LSAT and SAT, which largely test verbal, logical, and computational reasoning. That’s what these AI systems are designed to do.
Second, while the results for the licensing exams are more concerning, given that these programs presumably didn’t take formal professional education courses, it’s important to note that these performances were based on publicly available practice exams. Although the system was theoretically made blind to the original work (i.e., the test question) during training when it was being tested using a “holdout” method, it’s unclear the extent to which the GPT-4 system was exposed to these items through other sources. For example, since these are publicly available practice items, many are likely discussed in web forums or other derivatives of the original work. Therefore, for example, if someone took a PEAT item and posted a discussion about it on social media, that might have been among the training materials GPT-4 was exposed to.
Third, and relatedly, many of these questions aren’t the latest questions on the licensing exams, meaning some might be outdated, exacerbating the issue raised regarding exposure in training. GPT-4 likely doesn’t know whether it is answering a question that is current or outdated and no longer part of the licensing exam.
At this point, we don’t know that systems like GPT-4 can pass secure licensing exams, but presuming they are able to, it doesn’t undercut the validity of using these exams to assess human candidates for licensure.
First, we still need some way to assess humans in terms of whether they possess the minimum knowledge and skills, and our licensure exams are well-suited to that task. Second, despite bold claims that these systems are capable of reproducing complex reasoning, we haven’t seen many examples of these systems extending reasoning to novel tasks so much as reproducing likely responses based on what it has seen and read. Finally, licensing exams are just one step in the process of assessing whether someone has demonstrated the minimal competence required to work in a licensed profession. Other requirements typically include successfully completing a formal education program; demonstrating competence in an internship, practicum, or residency; and demonstrating sufficient ethical and moral character.
It is unlikely that someone who doesn’t know much about physical therapy could rely on GPT-4 to pass a current version of the NPTE. However, someone who knows some physical therapy might be able to use a generative text tool to help them answer some questions they’re not sure about. That said, even if someone could use one of these tools to help them on the NPTE, it’s not feasible that they would be able to do so during a secure in-person administration, which is required for all NPTE exam administrations.
We’re working on it. We could certainly adapt these tools to identify test questions that are easily guessable based on semantic or grammatical cues. If using GPT-like tools becomes more possible during in-person administrations, we could identify potential response patterns for candidates who rely on these tools and invalidate their scores.
It may be possible to use generative text AI tools to assist in item writing so that testing programs can write more test questions. The manual process for creating new test questions is quite extensive . However, if AI could make this process less laborious, there may be the potential to have each exam administration be unique. While we already have many safeguards in place to reduce any incentive to memorize, harvest, and post items, this process would reduce those incentives even further.
In some sense, that future is already here. Some programs are already using AI to generate unique questions based on item models. Item writers identify common types of questions that follow a specific pattern and have a range of variables and outcomes. For example, an NPTE item could be based on a model where a patient reports pain in their hip. Variables could include the age of the patient, their activity level, movements that increase or decrease the pain, and any other conditions the patient might have. These combinations of variables might lead to different diagnoses, treatments, or the need for further tests. The AI process takes all these factors and uses text models to generate large numbers of possible items.
As we continue to develop and use these tools, more applications may become apparent. FSBPT will continue to monitor both the challenges and opportunities that AI tools present when assuring that candidates are qualified to serve as PTs and PTAs.
Lorin Mueller, PhD, joined the Federation of State Boards of Physical Therapy (FSBPT) in 2011 as its Managing Director of Assessment. Prior to joining FSBPT, Lorin spent ten years as Principal Research Scientist at the American Institutes for Research in Washington, DC. Lorin received his PhD in industrial and organizational psychology with a specialization in statistics and measurement in 2002 from the University of Houston.