Confidence testing—How to answer a meta-question
By Frank Davidoff, MD,FACP
He who knows and knows that he knows is conceited; avoid him.
He who knows not and knows not that he knows not is a fool; instruct him.
He who knows and knows not that he knows is asleep; awaken him.
But he who knows not and knows that he knows not is a wise man; follow him.
The venerable multiple-choice question (MCQ) is by far the commonest way to assess the state of someone's medical knowledge, largely because the vast experience with MCQs gives them the unchallenged edge over other psychometric techniques. They do their job so well, in fact, that the MCQ-based statistical view of the medical education landscape dominates our thinking about the evaluation of medical expertise, especially at the so-called "high stakes" level; even the most basic assumptions underlying their modus operandi are now accepted almost without question.
More generally, the great power of statistical thinking makes it easy to forget that the statistical lens can introduce significant distortions into the pictures it produces. As part of the statistical armamentarium, MCQ-based psychometric measurement also filters medical reality, and in ways that are of particular concern. First, MCQs, as they are now used, require answers to be categorized as either right or wrong. Much of medicine does indeed lend itself to this kind of black-and-white thinking. However, medical knowledge in many other areas is incomplete, ambiguous or conflicting, hence not amenable to such simplistic categorization. The ultimate result is that it is difficult or even impossible to write MCQs for these "fuzzier" items.
Secondly and more to the point, MCQs come up short at a deeper level: the all-or-none "payoff" system associated with MCQs rewards guessing, which is why MCQs have acquired the alternative, and sardonic, name of "multiple-guess questions." A reward system that encourages guessing is no laughing matter, particularly since such a system rewards overconfidence, which is not necessarily in the best interests of physicians or patients.
Third, a "right-wrong" scoring system assumes the only knowledge that is worth anything (and given credit) is complete knowledge. Incomplete or partial knowledge--which, while useful, isn't sufficiently clear or complete to allow you to commit yourself unequivocally to a single answer--is considered worthless. "Right-or-wrong" scoring of MCQs thus penalizes the impulse to hedge, or, even worse, ignores and ultimately extinguishes hedging, even when hedging would more accurately reflect the state of your knowledge.
In its present form, therefore, MCQ testing fails to tap into valuable information about a learner's state of mind. In this instance, the information lost is not about factual knowledge per se, but rather about confidence in that knowledge. Diamond and Forrester have clearly framed the fundamental distinction between these two types of information (1). Knowledge, in their view, answers the question, "What do you know?" In contrast, confidence answers a "meta-question"--that is, a question about a question; in this case the meta-question is, "How sure are you of your answer to the question about what you know?"
In medical terms, it is the difference between a physician's statement that "There's a 60% chance this treatment will work for you" (knowledge) and the statement that "There's a 10% chance I know what I'm talking about" (confidence). Recognition of the difference has profound implications for medicine, since adjusting your level of confidence to match the state of your knowledge is critical in coping with the inevitable uncertainty of medicine. Unfortunately, making this adjustment is extremely difficult for most people, and physicians are no exception. Indeed, one thoughtful student of this problem has concluded that "Physicians ... will acknowledge medicine's uncertainty once its presence is forced into conscious awareness, yet at the same time will continue to conduct their practices as if uncertainty did not exist" (2).
Knowledge and confidence
Curiously, while MCQs reinforce overconfidence and, at the same time, fail altogether to assess confidence, the broader discipline of statistics explicitly recognizes both the difference between knowledge and confidence and the importance of that difference. It is, after all, biomedical statistics, that hardest and most quantitative of sciences, whose most critical quantitative measure of experimental results is the confidence interval (3). This term indicates quite clearly that statisticians understand that the main purpose of all their theory, their algebra and their hard numbers has less to do with changing people's factual knowledge than with adjusting their level of confidence in that (intrinsically uncertain) knowledge.
While the dominance of right-or-wrong answer MCQs reinforces the perception that confidence isn't worth measuring, can't be measured, or both, important work on confidence testing over the past 30 years suggests that neither of these inferences is justified (4). The basic theory of confidence testing is simple, but subtle. Instead of forcing people into making artificially confident yes-or-no choices among answers, only one of which is actually correct, confidence testing allows you to express your degree of confidence ("shades of gray") in all answers within a given set. Thus, if you are pretty sure that answer A is right, you might express this by assigning it a "probability" value of 0.7 (where zero means you are certain A is not the right answer, 1.0 means you are certain that it is). If you also think B might be correct but are a lot less sure, you might assign it a value of 0.3, while C, which you are almost but not quite sure isn't correct, might get a 0.1. A state of complete uncertainty about any of the answers would be expressed by choosing 0.5 for all three options.
This probabilistic approach is a step in the right direction, but by itself is not effective, since people soon learn that over many questions the payoff is greater if they place most or all of their confidence in one preferred answer, however limited their actual confidence in that answer might be--which in effect recreates the present "all-or-nothing" MCQ scoring system. While this appears to be a fairly intractable psychometric barrier, Brown and Shuford pointed out a way around it some years ago (5). Working, interestingly enough, in the area of government intelligence, they showed that people who choose from a special set of points linked to their degree of confidence in an answer will, over time, express their confidence more accurately than if they simply choose the probability that each answer is correct.
For answers that are correct, you gain a moderate number of points in this special system if you assign a high probability of being right; the payoff drops gradually toward zero as you assign probabilities closer and closer to 0.5 to correct answers (that is, the expression of true uncertainty is neither rewarded nor penalized). The payoff rapidly become very negative (points subtracted from the score) as the probabilities you assign to correct answers drop toward zero. These numerical payoffs are calculated from equations technically known as "reproducing scoring functions." Sometimes also referred to as "scoring functions that encourage honesty," their real punch lies in the fact that they allow you to maximize your score in the long run if (and only if) the confidence you have in your knowledge is perfectly calibrated to the correctness of that knowledge-that is, you are right 50% of the time for the group of answers to which you assigned a probability of 0.5 of being correct, correct 90% of the time for those you assigned a probability of 0.9, etc.
Getting an accurate picture
This form of confidence testing can be implemented using pencil and paper (for MCQs with up to three answers, only one of which is correct); using this approach, it is possible to distinguish the separate contributions of lack of knowledge vs. inappropriate confidence as the reason for less than optimal test scores (3). Importantly, there is some evidence that the statistical reliability of test scores increases when scores are corrected for inappropriate confidence. It is also possible, using extensions of this technique, to assess whether a test taker is overconfident (for example, the kind of person who turns out to be correct only 20% of the time in the group of answers to which he or she assigned a 0.9 probability of being correct) or under-confident (the person whose group of answers assigned a probability of, say, 0.3 turn out to be correct 70% of the time), and by how much.
Measured in this way, even second-year medical students in one study substantially overvalued the correctness of their answers, perhaps not surprising in view of their extensive experience with "right-or-wrong" MCQs. Even more interestingly, by exposure to the technique of confidence testing, these same students apparently learned not to overvalue the correctness of their knowledge or, stated differently, became better calibrated with respect to confidence in their knowledge (3).
Any way you measure it, miscalibrated confidence may be at least a much a problem in clinical medicine as lack of knowledge per se. And while it may be less obvious than overconfidence, inappropriate lack of confidence appears to be a particularly serious stumbling block for physicians, contributing to excessive testing and use of consultants or, at the extreme, the inability to make decisions at all. At the same time, physicians are understandably also very much concerned that under confidence, expressed to patients as lack of certainty or even the appearance of it, can undermine patients' trust. Such a lack of patient trust, physicians feel, may in turn truly compromise their abilities to treat. It seems equally obvious, on the other hand, that overconfident physicians may jump to inappropriate diagnostic and therapeutic conclusions, misapply their knowledge, fail to learn when they need to learn, and lose patients' trust in a variety of other important ways (2).
If it is, in fact, "the 'vital office' of scientific medicine to develop systems of thought and action that will permit physicians to account more fully for both the certainties and uncertainties that shape their practices" (2), then scientific medicine needs to reconsider the unspoken effects of the right-or-wrong mentality on our ability to deal with those certainties and uncertainties. It also seems obvious that effective techniques for recognizing, assessing and calibrating physician confidence, through confidence testing or other such measures, may be among the more important "systems of thought and action" we need to develop in order to carry out this "vital office."
Frank Davidoff is ACP's Senior Vice President for Education.
1. Diamond GA, Forrester JS. Metadiagnosis. An epistemologic model of clinical judgment. Am J Med. 1983; 75:129-137.
2. Katz J. Acknowledging uncertainty: The confrontation of knowledge and ignorance. In "The Silent World of Doctor and Patient." New York, Free Press, 1984, pp. 165-206.
3. Goodman SN. Berlin JA. The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results. Ann Intern Med. 1994; 121:200-6.
4. Rippey RM, Voytovich AE. Linking knowledge, realism and diagnostic reasoning by computer-assisted confidence testing. J Computer Based Instruction. 1983; 9:88-97.
5. Brown TA, Shuford EH. Quantifying uncertainty into numerical probabilities for the reporting of intelligence. Defense Advanced Research Projects Agency Report R-1185-ARPA. Santa Monica, Calif., Rand Corp., 1973.
Internist Archives Quick Links
Reviews of the World's Top Medical Journals—FREE to ACP Members!
ACP JournalWiseSM is mobile optimized with optional email alerts! Get access to reviews from over 120 of the world’s top medical journals alerting you to the highest quality, most clinically relevant new articles based on your preferred areas of specialty. ACP Members register your FREE account now!
New CME Option: Internal Medicine 2014 Recordings
New CME Package
Includes 75 of the most popular sessions in internal medicine and the subspecialties. Stream the sessions, answer brief quizzes and earn CME credit. See details.