An Appraisal of the Carlisle-Stouffer-Fisher Method for Assessing Study Data Integrity and Fraud
Data fabrication and scientific misconduct have been recently uncovered in the anesthesia literature, partly via the work of John Carlisle. In a recent article in Anaesthesia, Carlisle analyzed 5087 randomized clinical trials from anesthesia and general medicine journals from 2000 to 2015. He concluded that in about 6% of studies, data comparing randomized groups on baseline variables, before the given intervention, were either too similar or dissimilar compared to that expected by usual sampling variability under the null hypothesis. Carlisle used the Stouffer-Fisher method of combining P values in Table 1 (the conventional table reporting baseline patient characteristics) for each study, then calculated trial P values and assessed whether they followed a uniform distribution across studies. Extreme P values targeted studies as likely to contain data fabrication or errors. In this Statistical Grand Rounds article, we explain Carlisle’s methods, highlight perceived limitations of the proposed approach, and offer recommendations. Our main findings are (1) independence was assumed between variables in a study, which is often false and would lead to “false positive” findings; (2) an “unusual” result from a trial cannot easily be concluded to represent fraud; (3) utilized cutoff values for determining extreme P values were arbitrary; (4) trials were analyzed as if simple randomization was used, introducing bias; (5) not all P values can be accurately generated from summary statistics in a Table 1, sometimes giving incorrect conclusions; (6) small numbers of P values to assess outlier status within studies is not reliable; (7) utilized method to assess deviations from expected distributions may stack the deck; (8) P values across trials assumed to be independent; (9) P value variability not accounted for; and (10) more detailed methods needed to understand exactly what was done. It is not yet known to what extent these concerns affect the accuracy of Carlisle’s results. We recommend that Carlisle’s methods be improved before widespread use (applying them to every manuscript submitted for publication). Furthermore, lack of data integrity and fraud should ideally be assessed using multiple simultaneous statistical methods to yield more confident results. More sophisticated methods are needed for nonrandomized trials, randomized trial data reported beyond Table 1, and combating growing fraudster sophistication. We encourage all authors to more carefully scrutinize their own reporting. Finally, we believe that reporting of suspected data fraud and integrity issues should be done more discretely and directly by the involved journal to protect honest authors from the stigma of being associated with potential fraud.