Free Resources Page Header

Stack of Open Books

Text/HTML

Unscrambling the Mysteries of Cut Scores

By Rebecca Thomas
Forum Volume 17, Number 4

Want to test your understanding of variances and vagaries surrounding exam cut scores? Consider the following fictional scenario presented by Dr. Elizabeth Witt, a psychometrician with CAT*ASI, at the FSBPT 2002 Spring Education meeting in Orlando. In response to this scenario, Dr. Witt offered six alternative problematic situations for test development and scoring that might arise as a result of the scenario. Test yourself: How you would approach each situation? Then take a look at suggestions offered by Dr. Witt.*

THE SCENARIO

The Disney Board of Directors offers a certification examination delivered on computer. All Disney “cast members” and certain other employees must maintain certification, which is granted for 10 years. The same examination is to be used for recertification.

For the first eight years, the Board developed the examination internally and contracted with Computer Assessment & Training (CAT – because who would know more about a Mouse?) for delivery. The passing score for the 40-item test was 70% during the first eight years. The percentage of candidates passing on their first attempt has ranged from 78% to 98%. Candidates are allowed to retest as necessary, and the exam is offered quarterly throughout the year. The test form is changed quarterly.

After developing the examination internally for eight years, the Board has contracted with Snow White and Associates for test development and item banking.

SITUATION 1

Snow White and Associates are recommending a tedious and elaborate standard setting procedure wherein each question is rated individually by a panel of certified Disney experts. This exercise, known as the “Angoff Cut Score Study” would be costly to the Board in terms of travel, meals and vendor staff time. The Board would prefer to maintain the standard of 70% without paying for an extensive psychometric process.

Dr.Witt Comments

“Exam forms testing for the same material can vary greatly. No matter how hard we try to make sure test forms are equivalent in difficulty, they are always slightly different. For that reason, we don’t want to arbitrarily set a passing score at, say, 70%, because 70% on an easy form is not at all the same thing as 70% on a difficult form. School teachers often use 70% as a passing mark. This score is usually considered passing, but 70% could be a very high cut score or a very low one, depending on the difficulty of the exam.”

“To ensure greater fairness in test development and scoring, the American Psychological Association (APA), the American Educational Research Association (AERA) and the National Council on Measurement in Education (NCME) developed standards for testing. Basically, those standards, which are periodically revised, say that setting an arbitrary test score is not a good idea. The problem is if you can’t guarantee that different forms of the test offer the same content at the same level of difficulty, then you really can’t guarantee fairness for the candidates.”

SITUATION 2

The Board agreed to conduct an Angoff Cut Score Study following the assembly of an examination form. The average rating across the 10 experts was 80%. Using this standard would lead to a passing percentage of 50%, causing a shortage of new “cast members” in the theme parks. The Board suggests using the time honored standard of 70% instead of the results of the Angoff exercise.

Dr. Witt Comments

“We should consider whether we’ll be short on people if too many fail a test. But you might apply different criteria, based on whether you are licensing or certifying. Licensure exams are criterion-referenced tests. Although certification tests are criterion-based too, there’s more ‘wiggle room’ about exactly where you set that passing standard. In this situation, we might want to adjust that cut score away from what the committee recommended.

“Keep in mind that a committee of experts should include representatives form all subspecialties within the field and all relevant geographic regions. Sometimes this doesn’t happen because someone drops out at the last minute for instance. Also, by sheer accident, a group of subject matter experts (SMEs) might end up being to harsh or too lenient. Given such variables, allow for flexibility in the original standard setting for the exam.

“Once the SMEs set the cut score, we ask them to predict the pass rate. If it’s not reasonable, they have a chance to say so at that time. The governing board then usually looks at the cut score study, reviewing the projected pass rates to determine if they are reasonable. If the board does not find the cut score acceptable, typically we adjust the cut score up or down within one standard error of measurement.

“Finally, we analyze the cut score study, incorporating two standard errors of measurement below and two above – and we then determine the point within that range to set the cut score study. This approach adjusts for human error.

“Typically, we only do this the first time a cut score study is set. We then have a good idea of where that standard is. Subsequently, if we have enough volume in our test, we don’t use a separate cut score study for every exam. Instead, we equate each exam back to the old exam, ensuring that the forms are equivalent in difficulty. So, essentially, we have the same passing standard for every exam form and we require exactly the same knowledge and skill level to pass, regardless of how easy or difficult the form is.”

SITUATION 3

The Disney Board of Directors has now used a criterion-referenced standard for two years. The standard was set at 58% and maintained through statistical equating for the next seven quarterly administrations. The percentage of candidates passing the exam ranges from 80% to 98%, but on the eighth administration, the passing percentage drops to an unexpected 65%. The Board asks Snow White and Associates to change the standard for that administration rather than using the statistical equating, so that a more acceptable percentage of the candidates will pass.

Dr. Witt Comments

“You have to find out why the sudden change occurred. The first thing to do is check for mistakes. Find out if an error occurred in the scoring or the equating process. Look at possible peculiarities for the population of candidates for that test day. Perhaps the test date itself was a factor (such as on or near a holiday).

“Review the content of test items. Was it built according to the specifications, or did somebody suddenly decide to try putting more in a certain area (which we are not supposed to do)? In order to maintain the equivalency of the exam across the various forms, you follow the same content outline every time.

“Be careful of chasing a particular pass rate. Having established where you want your criterion to be at the beginning, and having evaluated the reasonableness of it in light of pass rates at that time, you can’t just start adjusting your standard.

“There are all kinds of things that may happen and we just can’t figure out why. Maintain the standards at the same level and look for trends. If you identify a trend, you likely will find the cause. But most of the time, an unexpected pass rate turns out to be a fluke. For some reason, the particular group of people taking the exam this time weren’t as capable as the group of people taking it the next time. And pretty soon the pass rates are back up where they belong.”

SITUATION 4

Certificates from the initial group of certified cast members will expire this year. The Board long ago decided that recertification candidates would take the same examination as the certification candidates. Some of the original certificants have already taken the exam, and statistical analyses show that the first-time certification candidates score higher than the recertification candidates do. The recertification candidates score higher on the few questions targeting the higher order cognitive abilities (synthesis, analysis). However, first-time certification candidates score higher on the majority of the questions, which target the lower cognitive ability of recall.

The Board suggests that a less stringent standard be used for the recertification. Snow White and Associates maintains that if the same test is used, then the same standard must be used for both groups of candidates.

Dr. Witt Comments

“You have to ask yourself why you’re recertifying people, what does recertification say. Maybe the two groups should not be taking the same test, especially if each group requires a different level of knowledge or skill. But it certainly seems silly to require less knowledge, which is the way it would look if you offered a lower standard on an exam for recertification candidates.

“The recertification candidates are actually scoring higher on items dealing with higher level cognitive processes. They probably just don’t recall some of the facts and details memorized by entry-level candidates.

“Possible solutions might include:

  • Offering a refresher course or providing study aids.
  • Creating a separate test for recertification.
  • Weighting the items differently, putting more emphasis on higher cognitive thinking items.
  • Considering something other than the exam for recertification, such as a workplace or “on-site” evaluation.

SITUATION 5

On the tenth administration of the examination, everyone passes (100% passing percentage). The new CEO of Disney asks Snow White and Associates to move the passing score so that at least one candidate fails. The CEO worries that if everyone passes, people will perceive that the test is not sufficiently rigorous.

Dr. Witt Comments

“Possible explanations for this situation might include:

  • Error in scoring or equating
  • Change in exam content from the blueprints
  • Very small sample
  • Change in the test population
  • Environmental influences (in one known case, a television special gave candidates more knowledge)
  • Cheating

SITUATION 6

The new CEO attends a national certification conference and hears rumors that the Disney certification examination is biased. He asks Snow White and Associates if there is a way to show that the test is not biased in favor of native Californians. The vendor runs a “DIF analysis” that shows native Californians score higher than other test takers on several test questions.

Dr. Witt Comments

“DIF stands for ‘Differential Item Functioning.’ With DIF, we group people according to their overall level of ability by measuring their total overall score on the test. Once we’ve grouped people by ability, we look at whether they have a different probability of answering a question correctly. If some people are likely to answer better because they are from California, for instance, then we say that test item has ‘DIF.’ Essentially, DIF is a statistical test for evaluating test content.

“Differential item functioning serves as a flag for further investigation of a particular test item. Once we flag a test item, we want to find out why the group difference occurs. If the reason people vary in their ability to answer the item correctly is related to the skill and knowledge you are trying to test, then the question is not biased. What you then have is two groups of people who differ in their amount of knowledge in a certain area. However, if the difference appears to be due to something completely irrelevant to what you are trying to test, then you would say the item is biased. In that case, you should remove or revise the question.

“Here’s an example from a fictitious engineering aptitude test: Pretend you have a bag beside your desk containing everything you need to build a gun. In 15 minutes, a ravenous Bengal tiger will be released into the room. You are to take all appropriate action.

“This item may, in fact, show DIF for gender. Men generally do better with it than women. Reasons might include previous knowledge of guns or physical strength to assemble a gun.

“If, as in this example, we find that the difference in performance on this test item occurs for reasons unrelated to engineering aptitude, we must re-evaluate the question. Often, subject matter experts need to be called in to wrestle with the question content.

“Finally, always beware of false positives or negatives when reviewing passing rates. Using DIF flags, although not foolproof, is a good way to identify items that are potentially biased.”

* Dr. Witt thanks Lynn Webb, a psychometrician from Chicago, who wrote the original version of this case study.