Legacy Scales and Patient-Reported Outcomes: A Case for Embracing Complexity
However, it also stems from our communal failure to insist on better alternatives. The paper by Andersen et al. exemplifies the struggle of dealing with the legacy of outcomes reporting in orthopaedics. The excellent/good/fair/poor paradigm that dominated the literature well into recent memory was gradually replaced by well-intentioned attempts to distill the essential “goodness” of a procedure into a single value encompassing elements of objective measurement, surgeon impression, and patient-reported function. These ratings were initially modeled from the Iowa Hip Score devised by Carroll Larson in 19631 and the similarly constructed hip scoring system of William Harris in 19692. Other clinicians and professional societies followed suit. The American Orthopaedic Foot & Ankle Society (AOFAS) scores used as the primary outcome variable in the current paper offer a case study3. Developed by an expert panel, the scores comprise small discrete choices within disparate categories of measurement, surgeon evaluation, and patient report. The mathematical construction of the scores alone creates a tendency to crowd the composite outcomes toward the ends of the scale if any interactions between the subcategories are present. Almost a quarter-century after it was developed, the scoring system has been neither validated nor shown to generate normally distributed data4,5. This was, in fact, borne out by the study by Andersen et al. as the data proved to be skewed when normality testing was applied.
These mixed-data systems must be abandoned. Even those with a less problematic construction than the AOFAS score create a mixture of patient-reported data, physical measurements, and surgeon impressions that combine to make an unwieldy and unreliable product. The argument deep in the discussion sections of many clinical papers is that legacy scoring systems must be included so that the results can be compared with other legacy papers. This is fallacious; modern bad data are no less bad than their historical counterparts.
To its credit, the current paper uses a combination of a mixed-data score, a disease-specific patient-reported outcome measure, a general health status outcome measure, and objective data. Although there are consistent differences between the groups, their magnitudes are relatively modest. A legitimate point of concern is that the only catastrophic outcome in the series, an ankle fusion, occurred in the suture-button group. How this 1 potentially very negative data point was censored could affect a range of the statistical outcomes. It is not a weakness of the paper, but rather a subtle choice of analysis that requires judgment. Whether one agrees with the authors’ choices or not, their thorough data reporting enables the discussion. The study is more interpretable not because of the authors’ choice of psychometric variables but in spite of them.
The failure of the legacy methods for evaluating outcome lies not in the inclusion of surgeons’ judgments or objective measurements, but in the reduction of multiple sources of data into one imperfect variable. Reality is not a pendulum swinging between surgical impression and patient report. Our studies are most robust when they embrace the complexity of the surgical event. The patient’s story is one part of that complexity. It is not the only one.