Comparison of Binary Predictive Scoring Systems of Posthepatectomy Liver Failure
Although external validation of predictive clinical scoring systems is essential, it is infrequently performed. Skrzypczyk and colleagues1 should be congratulated for undertaking a comparison of posthepatectomy liver failure (PHLF) definitions and their utility in predicting clinical outcomes. The authors conclude that the International Study Group of Liver Surgery (ISGLS) PHLF definition was “less discriminatory than the ‘50–50’ and ‘PeakBili >7’ criteria in identifying patients at risk of posthepatectomy major complications or death,” but this does not appear to be supported by the data presented.
The predictive accuracy of binary diagnostic tests can be compared in a number of ways. The Cochrane Collaboration are currently producing guidelines for comparison and meta-analysis of studies of diagnostic accuracy.2 A common initial strategy is to compare the discordance in sensitivity and specificity (using McNemar's test), though concerns have been raised about this approach and modifications suggested.3 Skrzypczyk et al show that the ISGLS definition is significantly more sensitive than the 50–50 and PeakBili >7 criteria, but less specific for predicting major morbidity.
An alternative approach is to determine overall predictive performance, for instance, by comparing diagnostic odds ratios or likelihood ratios.4 The authors take a similar approach but do not report a statistical comparison of results. We have compared the reported odds ratios and confidence intervals between the 3 scoring systems for predicting morbidity or mortality. Although a numerical difference between scores exists, this fails to achieve traditional measures of statistical significance (Fig. 1).
Maximizing sensitivity and specificity improves overall predictive performance but is not always desirable. In PHLF, it may be useful to increase the true detection rate (sensitivity) at the expense of more false positives (specificity). This is not a population-screening tool, where the minimization of false positives is important to avoid unnecessary anxiety, diagnostic tests, and invasive procedures. We argue that a patient termed as having PHLF who does not go on to have a complication or die has few negative consequences of having been labeled such. Yet, there may be much to be gained by increasing the “pick-up rate” of patients at risk of poor outcomes.
We agree with the authors’ assertion that in this cohort all 3 definitions were poor at identifying patients who go on to have a complication or die: the association between these blood markers and outcome was not strong. We were less clear, however, as to why the 3-point grading system associated with the ISGLS score had not been used in the assessment of test performance?5
Determining the accuracy and utility of predictive scores is important and studies should be described using standardized reporting methods.6 An alternative conclusion to this study is that there is weak evidence for any difference in test performance, but that weighting true detection rate over the avoidance of false positives may be more appropriate for this condition.