The effects of bias (over- and underestimates) in estimates of disease severity on hypothesis testing using different assessment methods was explored. Nearest percentage estimates (NPE), the Horsfall–Barratt (H-B) scale, and two linear category scales (10% increments, with and without additional grades at low severity) were compared using simulation modelling to assess effects of bias. Type I and type II error rates were used to compare two treatment differences. The power of the H-B scale and the 10% scale were least for correctly testing a hypothesis compared with the other methods, and the effects of rater bias on type II errors were greater over specific severity ranges. Apart from NPEs, the amended 10% category scale was most often superior to other methods at all severities tested for reducing the risk of type II errors. It should thus be a preferred method for raters who must use a category scale for disease assessments. Rater bias and assessment method had little effect on type I error rates. The power of the hypothesis test using unbiased estimates was most often greater compared with biased estimates, regardless of assessment method. An unanticipated observation was the greater impact of rater bias compared with assessment method on type II errors. Knowledge of the effects of rater bias and scale type on hypothesis testing can be used to improve accuracy and reliability of disease severity estimates, and can provide a logical framework for improving aids to estimate severity visually, including standard area diagrams and rater training software.