Statistics Commentary Series: Commentary No. 21

Excerpt

You administer the Structured Clinical Interview for DSM-5 (SCID-5)1 to a patient and the results indicate the presence of a generalized anxiety disorder (GAD). How likely is it that the person actually has a GAD? The issue is that the SCID-5, as with all tests – interviews, paper-and-pencil questionnaires, lab results, and everything else we use to diagnose or classify people – can misclassify people in 2 ways: labelling healthy people as having the condition, and vice versa. So, should this person be referred for therapy or entered into a study, or is this an instance of misclassification? As we’ll see in this commentary, the answer depends on both the psychometric properties of the scale as well as the population from which this person came.
In order to understand the issues, we have to take a look at how tests are validated in the first place. We begin by assembling 2 groups of people, 1 group of which has the disorder the new test is designed to detect, and another group that does not. This assumes that there is some gold standard, such as a clinician’s judgment or some other test, which may not be golden at all, but is the best we have available so we have to live with that. Let’s assume we have 50 people in each group to whom we administer the new test, and we now display the results in a 2 × 2 table, as in Table 1. (It is rumored that if you ask a statistician any question, he or she will either draw a normal curve or make a 2 × 2 table; we’ve done the latter here. The normal curve will appear in the next paper, thus proving the rumor to be true.) Now let’s define some terms.
The sensitivity of a test is its ability to detect the condition when it is present, so in this table:
JOURNAL/jcps/04.02/00004714-201708000-00002/math_2MM1/v/2017-07-26T065502Z/r/image-tiff
The letters refer to the cells in the table, so that Cell A shows the number of people with the disorder whom the test correctly identifies; Cell B those without the disorder, incorrectly identified by the test; Cell C those with the disorder who are missed by the test; and Cell D the non-cases correctly ruled out by the test. That is, this test accurately picks up 80% of the cases. Conversely, the specificity of a test refers to its ability to rule out the condition when it is absent, so:
JOURNAL/jcps/04.02/00004714-201708000-00002/math_2MM2/v/2017-07-26T065502Z/r/image-tiff
In other words, the test can accurately detect 90% of the non-cases. So far, the results look promising. In fact, this fictitious test is comparable to many diagnostic tests that are widely used, such as screening mammography in the general population (sensitivity = 0.69; specificity = 0.94),2 or the Mini-Mental State to detect dementia and delirium among hospital patients (sensitivity = 0.87; specificity = 0.82).3
However, once the new test is accepted and used in practice, clinicians or researchers will not also use the gold standard; after all, the new test was likely developed so that the presumably more expensive or more time consuming standard could be replaced. All they will have are the results of the new test, and we change the nature of the questions.